Start Submission

Reading: User Requirement Solicitation for an Information Retrieval System Applied to Dutch Grey Lite...

Download

A- A+
Alt. Display

Research Article

User Requirement Solicitation for an Information Retrieval System Applied to Dutch Grey Literature in the Archaeology Domain

Authors:

Alex Brandsen ,

Leiden University, NL
About Alex
PhD candidate in Digital Archaeology at Leiden University, using text mining to access information in grey literature.
X close

Karsten Lambers,

Leiden University, NL
X close

Suzan Verberne,

Leiden University, NL
X close

Milco Wansleeben

Leiden University, NL
X close

Abstract

In this paper, we present the results of user requirement solicitation for a search system of grey literature in archaeology, specifically Dutch excavation reports. This search system uses Named Entity Recognition and Information Retrieval techniques to create an effective and effortless search experience. Specifically, we used Conditional Random Fields to identify entities, with an average accuracy of 56%. This is a baseline result, and we identified many possibilities for improvement. These entities were indexed in ElasticSearch and a user interface was developed on top of the index. This proof of concept was used in user requirement solicitation and evaluation with a group of end users. Feedback from this group indicated that there is a dire need for such a system, and that the first results are promising.

How to Cite: Brandsen, A., Lambers, K., Verberne, S. and Wansleeben, M., 2019. User Requirement Solicitation for an Information Retrieval System Applied to Dutch Grey Literature in the Archaeology Domain. Journal of Computer Applications in Archaeology, 2(1), pp.21–30. DOI: http://doi.org/10.5334/jcaa.33
173
Views
26
Downloads
12
Twitter
  Published on 18 Mar 2019
 Accepted on 27 Feb 2019            Submitted on 14 Jan 2019

1 Introduction

The archaeological world creates huge amounts of text in different formats, from books and scholarly articles to unpublished fieldwork reports. These reports are also known as grey literature. Easy access to the information hidden in these texts is a substantial problem for the archaeological field. Making these documents searchable and analysing them is a time consuming task when done by hand, and will often lack consistency. Text mining and Information Retrieval (IR) provide methods for disclosing information in large text collections, allowing researchers to locate (parts of) texts relevant to their research questions, as well as being able to identify patterns of past behaviour in these reports (Richards, Tudhope & Vlachidis 2015).

The Malta convention (or Valletta Treaty) is a European treaty, signed on 16 January 1992. It came into effect on 25 May 1995, and its aim is to protect archaeological remains by making ‘the conservation and enhancement of the archaeological heritage one of the goals of urban and regional planning policies’ (Council of Europe 1992, Art. 1). The convention was implemented in the Netherlands via the Archaeological Heritage Management Act in 2007 (Ministerie van Onderwijs, Cultuur en Wetenschap 2007). Preferably, preserving these remains is done by keeping them in situ, but when this is not possible, the developer disturbing the ground record is required by law to pay for the archaeological research. This research is generally performed by commercial archaeology units.

This archaeological research has created a collection of texts that is too large to be completely read by humans. The amount of reports created in the last 20 years is currently estimated at just under 60,000, and is growing by approximately 4000 per year (RCE 2017). Most of these reports are categorised as ‘grey literature’ (Evans 2015), and are likely to end up in a proverbial ‘graveyard’, unread and unknown, unless they are properly archived, indexed and disclosed.

In the Netherlands, the SIKB (Stichting Infrastructuur Kwaliteitsborging Bodembeheer) creates and maintains the standards of activities relating to soil management. As stipulated in their BRL 4000 guidelines, a report has to be deposited into an e-depot within 2 months of completing the project (BRL 4000 2016: Art. 2.6.2). While some companies and municipalities are still reluctant to deposit their reports into national e-depots (instead opting to deposit in small local depots) most reports and the associated metadata do end up in one of three of the main e-depots of the Netherlands; the Data Archiving and Networked Services (DANS) repository, the Document Management System of the Rijksdienst voor Cultureel Erfgoed (RCE) or the Koninklijke Bibliotheek (KB) e-Depot. There is considerable overlap between the DANS, RCE and KB datasets, and altogether it is estimated they hold around 70 percent of all so-called Malta reports. This means that a large portion of the reports is currently available, and access to the files is not a major problem at the moment. However, what is currently lacking, is access to the full texts of the reports. Full text search would allow researchers to access the content of the documents, and enables more precise and detailed information retrieval.

This paper describes the work carried out in the first year of a PhD project on Text Mining in the archaeological domain. This project is in association with both the Faculty of Archaeology and the Data Science Research Programme (DSRP) at the University of Leiden, combining archaeological knowledge with the technical skills available in the Data Science department.

The work carried out in this project is motivated by the need of researchers in the archaeological field to be able to efficiently and effectively find information related to their research questions in the available grey literature. This requirement has been well documented in previous work (e.g. Richards, Tudhope & Vlachidis 2015; Van den Dries 2016) and some studies have investigated different applications of text mining from archaeological reports in English (Vlachidis and Tudhope 2016; Amrani, Abajian & Kodratoff 2008; Byrne and Klein 2010; Vlachidis and Tudhope 2015) and Dutch (Paijmans and Brandsen 2010; Vlachidis et al. 2017).

However no system is currently available that allows full-text access to a substantial part of the Dutch archaeological document collection. As a result, relevant and valuable information is not being utilised by researchers, mainly by those who are not experts in their (sub)field yet. Information like a single Bronze Age find in a otherwise Medieval site is unlikely to be mentioned in the metadata, and is thus nearly impossible to find. This is a problem from a theoretical point of view, as key information could be overlooked at the moment, information that could change archaeological interpretations. It also devalues the monumental effort that has gone into collecting, digitising, archiving and publishing these documents, as well as the legislation that has been drawn up surrounding the archiving of these documents.

More and more text mining, data mining and IR tools and techniques have become available over the last years, which could potentially provide a way to access and extract information from this wealth of data currently hidden in these reports. This, combined with the relatively easy access to higher computer processing power, makes a systematic implementation of text mining techniques for Dutch archaeological reports not only desirable, but also feasible.

In this project we are developing AGNES (Archaeological Grey-literature Named Entity Search), a search system that aims to make archaeological grey literature more accessible and searchable by applying IR techniques to this big dataset.

The goals of this paper are (1) to give an overview of previous work on text mining in archaeology, (2) to show the need for a search system by interviewing the user group, (3) soliciting user requirements for such a system, (4) presenting the results of the initial experiments with Named Entity Recognition (NER) and (5) presenting the indexing and front end software of the developed system.

Section 2 places the research in context, while section 3 provides some technical background information on the developed system. Section 4 discusses the user study and requirement solicitation.

1.1 Data

The data used in this research is a dump of all Dutch archaeology reports from the DANS repository that were available as PDF files in 2016. These include both scanned paper documents that have been OCRed and born-digital documents. It is expected that the OCRed documents contain errors to various degrees, which will complicate any efforts to apply text mining to them. We estimate that only about 15% of the dataset are OCRed PDFs, and all new documents will be born-digital, so this percentage will decrease over time.

2 Prior Work

Some experiments have been carried out in text mining in archaeology, across multiple countries and languages. This includes work by Epure et al. (2015), describing how to mine process models from natural text, and also recent works by Øyvind and Martin-Rodilla presented at EAA 2018, describing automatic information extraction from reports and semiautomatic analysis on heritage related legal texts, respectively (currently unpublished). Related work is also carried out in other disciplines such as history, a notable example being ALCIDE, a system that extracts and visualises content from large document collections in the history domain (Moretti et al. 2016).

More specifically, some work has been carried out on NER; the finding and classifying of concepts in text. In English, one of the earliest contributions is the work by Amrani, Abajian & Kodratoff (2008), which helped experts to extract information from archaeological literature. Byrne and Klein (2010) also investigated the extraction of information, but focused solely on event information. The OPTIMA system, described by Vlachidis (2012), used a rules-based approach to semantic indexing, including NER. Another notable project is Archaeotools in the UK, which combined databases with information extracted from reports in an interesting faceted browser interface (Jeffrey et al. 2009). A more recent paper is that by Kintigh (2015), which provides a detailed overview of the problems and possible solutions, but does not include the development of a search system.

For Dutch language reports, most of the previous research has been carried out by Paijmans with several collaborators, including extracting monument names from free text fields (Paijmans and Brandsen 2009) and the OpenBoek system, which used memory-based learning to perform NER (Paijmans and Wubben 2008; Paijmans and Brandsen 2010). Like the work by Byrne and Klein (2010), this project focused mainly on time periods, but also applied some rules-based NER to detect place names. The OpenBoek system included an online search interface during the CATCH (Continuous Access To Cultural Heritage) project, but unfortunately this is not available anymore.

More specifically, our project builds upon the text mining experiments performed by researchers of the University of South-Wales in the European ARIADNE project between 2013 and 2017. They applied a rules-based technique to the problem, utilising the GATE (General Architecture for Text Engineering) framework (Cunningham et al. 2013). A limited number of eight Dutch reports were analysed and compared to manually tagged ‘gold-standard’ documents as a proof of concept, next to English, Swedish and German reports. In the same project, the ADS (Archaeological Data Service) in the UK applied machine learning techniques to English grey literature, and developed an API that can automatically create metadata based on entered text (Vlachidis et al. 2017).

The contributions of this paper compared to previous work are twofold: (1) this system includes a user study which has not previously been undertaken, to collect the user needs for text mining in the archaeological domain; and (2) it combines the results of the NER with a full-text index in an effective search interface.

More broadly, this project is in cooperation with the DSRP, which gives us access to a high computing power cluster, allowing for the use of more computationally expensive techniques on bigger document sets. The length of this project is also an important asset; most previous experiments were often performed over a short amount of time, making it difficult to create a finished system, while this project takes place over four years with the specific aim of creating a user-friendly web application.

3 Introducing AGNES

AGNES stands for Archaeological Grey-literature Named Entity Search, and is the name of the search system currently under development in this project, including both the front end of the web application, as well as the indexing software responsible for finding and indexing archaeological concepts. The logo of the system can be seen in Figure 1. The current version of the system (v0.2) is available at https://agnessearch.nl/index.php/search/. (Please note, free registration is needed to access the system.) The source code will be made available later in the project.

Figure 1 

AGNES Logo.

3.1 Named Entity Recognition

A standard full-text index, allowing researchers to search through all of the text instead of just the metadata, would already be an improvement on the current situation. However, such a full-text search would not account for synonymy and polysemy; multiple words that have the same meaning and one word having multiple meanings, respectively. See Table 1 for two non-exhaustive examples, where a full-text search would either not return all results, or return possibly wrong results. This is why NER is needed to accurately index these documents.

Table 1

Synonymy and Polysemy examples.

Synonomy Polysemy

Main Term Neolithic Main Term Swifterbant

Synonyms Late Stone Age Meanings Time Period
3000 BC Excavation
5000 BP Pottery Type
4th Millenium BC Location

NER is a method that aims to identify and classify specific entities in natural language, also known as unstructured written text (Marrero et al. 2013). In the case of this project, the entities are archaeological concepts, and the natural language are excavation reports. To give an example, in the following sentence the entities are bold: ‘We found pottery dating from the Neolithic inside a rubbish pit’, an artefact, a time period and a feature/context, respectively.

In the current version of the system, we used Conditional Random Fields (CRF) to train the named entity recogniser (Okazaki 2007). This is a form of machine learning specifically designed to label sequence data (Lafferty et al. 2001), a common choice for NER tasks as words in a sentence are sequential. We implemented the scikit-learn Python package (Pedregosa et al. 2011), using the default algorithm (gradient descent using the L-BFGS method). The input for this algorithm were manually tagged Dutch reports created in the ARIADNE project (Vlachidis et al. 2017), specifically selected to be a good sample of the corpus. In total, this training set consists of roughly 500,000 words, containing 11,000 tagged entities. Some issues with these documents are discussed later in this section.

The annotated .docx files were tokenised and Part Of Speech (POS) tagged1 using Frog (Van den Bosch et al. 2007) and then converted to the FoLiA XML format (Van Gompel and Reynaert 2013). These steps are needed as CRF requires the input to be tokenised and POS tagged. Subsequently, the documents were converted to the format scikit-learn requires; a list of tokens including the token’s POS and category (or concept) tag. At the moment, only three archaeological categories are used as these have the most training data available: artefact, time period and material, although more categories will be added in later versions. For each token, the following features were extracted for the word itself, as well as the word before and after the current one:

  • Word in lowercase
  • Word starts with uppercase character
  • Word is all uppercase
  • Word is all numbers
  • Part of speech tag
  • Word exists in materials wordlist
  • Word exists in periods wordlist
  • Word exists in artefacts wordlist
  • Word is the beginning of a sentence
  • Word is the end of a sentence

This default feature set is meant to provide a baseline result. To evaluate the results of the NER, a leave-one-out eight fold cross validation was done, meaning that the algorithm is run eight times, each time using seven of the documents as a training set, and using one document to test the model. It rotates through all eight possible combinations, and then calculates an average of the accuracy of the model. The total averaged accuracy (F1 score) is 56%, with the results for the different categories presented in Table 2. As can be seen from this table, the average precision2 is fairly high at 71%, but the recall3 is much lower at only 48%. This means that 71% of the automatically labeled entities are correct, but only 48% of all present entities were found by the automatic labeling.

Table 2

Precision, recall and F1-scores for the 3 targeted entities, on a scale of 0 to 1.

Precision Recall F1-Score

Artefact 0.76 0.40 0.53
Time Period 0.65 0.58 0.61
Material 0.72 0.46 0.56
Average 0.71 0.48 0.56

When assessing the results of the NER, we discovered that there are some issues with the gold standard documents which could affect the accuracy. It seems that some tagging decisions were made that mean that entities are expanded to the left or right. For example, wherever the word ‘before’ or ‘after’ occurs before a time period, these words are included in the tag, while ideally these shouldn’t be included as they aren’t part of the time period itself (e.g. na de 3e eeuw ‘after the 3rd century’). If the NER then fails to classify these prefixes as the entity, the recall will be lower than the precision, which can also be seen in our results.

The artefact, time period and material wordlists were taken from the Archeologisch Basis Register (ABR), a thesaurus for Dutch archaeology maintained by the RCE. It contains phrases that are written in such a way that they do not match the way we would find these phrases in natural language. For example, the entry for ‘doorboorde bijl’ (perforated axe) is ‘bijl, doorboord’ (axe, perforated) in the thesaurus, making it difficult to match the two. These two issues are further discussed in section 5.

The code described in this section is available at DOI: 10.5281/zenodo.1238861.

3.2 Indexing & Front End

For this version of AGNES, 100 randomly selected reports from the DANS repository were indexed. For each page in these documents, the trained CRF model is used to extract the named entities. These are combined with the full text of the page and converted into a JSON structure, which can then be indexed directly by ElasticSearch (Gormley and Tong 2015), an open source search engine running on a web server. ElasticSearch uses JSON over HTTP to index and retrieve information, making it suitable for integration with other systems. The other advantage of using ElasticSearch is that it includes a number of features by default that are very useful for these kinds of search systems, including a result ranking system.

To query the index, a front end user interface has been developed. As a framework for the web application, the free and open source content management system Concrete5 was used (concrete5 2018).

To create a query, the user can use a query builder (Sorel 2018) that allows for boolean AND/OR logic. They can specify exactly which entity you are looking for in each part of the query, or select a general full-text search (see Figure 2). This allows for complex queries such as:

Figure 2 

AGNES front end screenshot showing a query for flint flakes from the neolithic or bronze age.

artefact:scraper AND (period:neolithic
OR period:mesolithic) AND fulltext:burnt

which returns results on scrapers from the neo- or mesolithic that also mention ‘burnt’.

The query is then converted to a JSON format by the front-end application, and the ElasticSearch index is queried using the ElasticSearch-PHP client (Tong 2018), resulting in a list of matching results. It is useful to rank and sort these results by relevance, so the documents that are most likely to be relevant to a query are at the top of the list. To do this, ElasticSearch calculates a score for each result, which is based on the importance of each query term that appears in that document (ElasticSearch 2018).

Once the results are displayed, the user can view a snippet of the text surrounding the keywords, preview the page of the report or go directly to the project archive in the DANS repository to download the PDF document. No PDFs are made available on the AGNES server to deal with the copyright of these files. A graphical representation of the full workflow of AGNES can be found in Figure 3, which also displays the split between pre-processing of the documents on a high-performance cluster, and the indexing and querying that takes place on a standard web server.

Figure 3 

AGNES Workflow.

4 User Study

Part of this research includes a user study, to ensure the needs of the potential users are met. The focus group, as well as the methods and results of the first workshop, are detailed below.

4.1 Definition of target audience

To be able to make an effective search system, it is required to define the expected users of the system. As the main goal of this system is to make information available for research, the expected user is a researcher working in Dutch archaeology. These researchers can be in a variety of organisational levels, including academia, commercial archaeology and regional/national government.

One of the main user groups expected to use this system are academics and people in higher education. However, this group is not homogeneous, as e.g. a professor will have much more in-depth knowledge and will already be aware of most of the literature and field reports related to their field, in stark comparison to e.g. a bachelor or PhD student who will still be exploring the literature and information available. Because of this difference in knowledge, these users will ask different questions of the dataset and in different ways. However, regardless of their knowledge level it is expected that academic researchers will generally be asking thematic questions of the dataset; questions about a particular time period, artifact type, context and/or location.

Another main user group is researchers in Dutch commercial archaeology. While this group will also be interested in the documents, it is likely that they will mainly want to use the system to find all information about a particular geographic area. This is because the main use of these reports for commercial archaeologists is to create desk assessments (bureau-onderzoeken) and archaeological prediction/expectation maps (archeologische verwachtingskaarten) about a specific area, generally because the area surrounds a potential building site. As some maps are also created by period, combined queries of place and time are also expected. There are three types of commercial archaeology, each are expected to have slightly different needs and requirements. These three types are inventorisation (investigating existing research), exploration or prospection (e.g. surveys and coring) and excavation (generally after the previous two types have been completed).

A third expected user group is municipal and regional (or provincial) archaeologists. Regarding their requirements, these will most probably fall in between academic and commercial archaeologists. While generally they will research a certain timespan in a particular area, it is likely that they will also want to research broader themes. However, generally they will be aware of all the available literature in their area already, so perhaps a search system is less useful for this group.

Researchers at the RCE are a fourth user group, and will probably have similar needs to municipal archaeologists, except they are working on a country wide geographical scale. These researchers will commonly work on nation-wide synthesising research, combining the information from a large number of reports into a larger picture.

Outside of the archaeological sphere, it is possible that the system will also be used by historians researching specific time periods such as the Middle Ages, where there is an overlap between archaeology and history. It is expected these scholars will have similar requirements to archaeological academics.

Lastly, it is possible that this system might be used by amateur archaeologists, amateur historians, metal detectorists and other enthusiasts, for a variety of reasons.

4.2 Focus group

In order to collect the requirements of archaeologists in the Netherlands, a focus group was set up. Members of the focus group participated on a voluntary basis. This group’s function at the start of the project is to provide their needs and wishes for a system like this, while in further stages of the project they can provide feedback on the developed features. The size and make up of this group is flexible, and can be changed during the project to fit with the current goals and/or address issues of representativeness.

The focus group has been selected to be as representative as possible for the Dutch archaeological landscape, taking into account the target audience definition from section 4.1. The group consists of 5 academics, 2 commercial professionals and 2 archaeologists working on different levels in government. See Table 3 for a more detailed break down of the participants.

Table 3

Overview of participants in focus group per category.

Group Situation Count

Academia PhD Student 3
Academia Assistant Professor 1
Academia Lecturer 1
Commercial Archaeology Excavation 1
Commercial Archaeology Prospection 1
Government Municipal 1
Government National 1

No amateur researchers were selected for the focus group, mainly because they are not an intended user of the system, but also because their approaches to research are so wide ranging, it would be virtually impossible to assemble a representative group of people.

4.3 Prototype for discussion

From personal experience in commercial software development, as well as experiences from IR researchers in other fields (e.g. Verberne, Boves & van den Bosch 2016), it seems that users in general, but users from the humanities specifically, find it difficult to express their requirements, oftentimes resulting in broad requirements that are too vague to interpret and implement. This can be further compounded by a lack of understanding of what is technically possible, leading to overly optimistic or very cautious expectations. We therefore first created a prototype with limited functionality (as discussed in section 3) as a starting point for discussions, in order to elicit feedback that is more detailed and can be implemented properly.

4.4 Workshops

The focus group will gather once a year during the project, for a total of 4 workshops. The initial workshop has been conducted, with the main aim of soliciting the requirements of the users. Later workshops will focus more on assessing the system and its results. Minutes will be taken at each session to record the comments and feedback of the group, and these will be made public after anonymisation.

The first workshop started with an introduction to the problem, as well as some background information on IR and NER (see also section 3.1). The group was then asked what their current search behaviour is, and what problems they encounter, before being shown a prototype of the system (v0.2) and asked to provide feedback on both the functionality and the relevance of the results.

Finally, specific user requirements were discussed. A suggested list of features was provided to the participants, who then discussed amongst themselves in groups of 2 which features they would find most useful, on a scale of 0 to 3 with 0 being not useful or relevant at all, and 3 being very useful and high priority. The participants were also asked to think of features not currently on the list.

4.5 Results

From comments of the group, it was clear that the grey literature problem is very familiar to everyone present. Feedback on their current search behaviour showed that most people use the DANS search functionality (found at DANS 2019) and find it not sufficient for their search needs, with most people having to manually search through individual documents to find information. Some participants, instead of using DANS, usually ask experts in the field to provide them with references. The Archis4 system is used to a lesser degree, again mainly because the search functionality is not sufficient. Some people explained that they create their own literature lists with keywords to be able to find materials previously accessed.

Initial feedback on the prototype indicates that the users find the returned results relevant to their queries, however much improvement is needed on the front end, as further discussed in the next paragraph.

The results from the feature elicitation were interesting; unanimously, everyone agreed that indexing by chapter and section would be more useful than indexing by page or document, and that this should of be high priority in the further development of AGNES. Another high priority feature across the group was to implement searching by drawing a polygon on a map as well as plotting results on a map, an indication that archaeologists have a strong need for geographical search. Another interesting result is that in general, everyone preferred to get many results with some irrelevant documents, than to get a smaller set of documents that are all relevant, with the risk of missing some documents. This means that the recall of the system is more important than the precision, which needs to be taken into account in assessing the results of the NER as well as the overall system evaluation. For a full overview of the averaged result for each feature, please see Table 4.

Table 4

Features and average scores (0–3) across focus group (n = 9), in decreasing order of average score. Facets mean the option for users to refine results by selecting metadata categories, as often found on online shopping websites. An asterisk (*) indicates a feature suggested by a user.

Feature Average

Search on map – plot results on map 2.78
Search on map – draw polygon 2.56
High recall over high precision 2.56
Search on map – morphology/expectation overlay* 2.44
Index by chapter/section 2.33
Facets – time/artefact/place 2.22
Facets – research type 2.11
Personalise – alert if new docs in saved search* 2.11
Related documents – by area 1.89
Facets – timeline 1.78
Personalise – save search* 1.78
Related documents – by time 1.78
Ordering – by relevance 1.78
Personalise – mark documents as ‘seen’* 1.78
Ordering – by distance 1.67
Related documents – by artefact 1.67
Related documents – general 1.56
Plot terms in document 1.56
Ordering – by date added 1.11

5 Future Work

One of the goals of the project is to expand the corpus from just the DANS documents to also including documents from the RCE and the Koninklijke bibliotheek, and creating a pipeline or API that allows for new documents added to these 3 repositories to be automatically added to the index. The work discussed in this paper is the result of the first year of a 4 year project. Each year, a new version will be developed, tested, and assessed by the focus group. Part of this evaluation will include a threat to validity study.

The first issue that needs to be resolved is the training data for the NER. It seems that entities have been tagged sub-optimally for the NER task, and it is expected that improving the annotations will increase the accuracy of the model. We are currently enlisting the help of a group of archaeology students to re-tag these documents, and possibly tag new documents as well. We will have multiple people tag the same documents, so we can calculate the inter-rater agreement (Cohen 1960); a measure of to what extent two humans agree on the annotation, which is an indirect indication for the difficulty of the task.

The other problem that will be addressed are the short-comings ABR wordlists (see in section 3.1 the example of ‘perforated axe’). We are currently in discussion with the Rijksdienst voor Cultureel Erfgoed (RCE), who manage these lists, to see if it is possible to add a new field for either the lemma of the word or to include multiple spellings of a word. After these two tasks have been completed, we will train the model again to see what difference these adjustments make.

Once that baseline has been established, we will integrate word embeddings as features, using word2vec (Mikolov et al. 2013) and fasttext (Bojanowski et al. 2016). These are both unsupervised machine learning techniques, that place words into a high-dimensional vector space based on their context in the text. The words can then be clustered using e.g. k-means clustering, with the idea that similar words are clustered together. See Figure 4 for a two dimensional (instead of high-dimensional) representation of this idea, where group 1 contains artefact types and group 2 contains materials. The advantage over using a word list is that related concepts not in the list, as well as misspellings of the concept, will also generally get assigned to the same cluster. Hopefully, this will increase the accuracy of the NER.

Figure 4 

2D representation of clustered word embeddings.

Regarding new features of the front end, according to the focus group the map functionality is the most required, including searching on a map and displaying results on a map. We are in the early stages of implementing this functionality and will hopefully present this in a future paper. Integration with common GIS systems is another avenue of research. Another feature with high priority is to index the documents by chapter or section, instead of by page as is currently the case.

To further evaluate the system, we will apply future versions to archaeological case studies. The plan is to find a specific archaeological information need, e.g. find all Iron Age cremations in the Netherlands and their geographical positions. We will then compare the results from AGNES with what experts currently know about this topic, and see if a significant increase in knowledge can be detected, probably by calculating the difference and overlap in numbers.

Currently, the system is focused on reports in Dutch, but as this problem is prevalent across the world, we will attempt to make the system multi-lingual, or at least provide ways of easily adapting the system to other languages.

6 Conclusions

From the user study, it is clear that a system such as AGNES is highly desirable for Dutch archaeology. The features assigned highest priority by the focus group are fairly uniform, which makes planning a roadmap of features straight forward. The first tentative feedback from the focus group is that results in AGNES are relevant to the queries, but more needs to be done to improve the functionality of the system.

From a technical viewpoint, the NER using CRF with a basic feature list resulted in an overall accuracy of 56%; a good baseline to build on. Fixing the problems with the gold standard and wordlists, as well as introducing word embeddings as features, should increase the accuracy.

Overall, it seems that AGNES can address the problem of grey literature in Dutch archaeology, although this needs to be evaluated more thoroughly by comparing the results to expert knowledge. The systems developed should easily be adapted to other languages and areas as well. We are hopeful that AGNES will help archaeologists to answer their research questions more effectively and efficiently, leading to a more coherent narrative of the past.

Data Accessibility Statement

The data used in this research (excavation reports) can not be made publicly available as they are under embargo, under copyright, only available to certain groups, or need to be requested from the authors. Some files are available open access, and these files can be accessed via the DANS EASY portal (DANS 2019). The metadata for all of the reports can be accessed here as well.

Notes

1Tokenisation is the process of converting a character sequence (text) to individual tokens (words and punctuation). POS tagging is assigning a grammatical part of speech to each token, such as noun, verb, and so on. 

2A measure that indicates how often the algorithm is correct when it classifies a word as an entity. 

3A measure that indicates how many entities are found, out of all the actual entities in the text. 

4Archis is a national database of archaeological sites in the Netherlands, maintained by the RCE, in Dutch (Rijksdienst vvor het Cultureel Erfgoed 2019). 

Competing Interests

The authors have no competing interests to declare.

References

  1. Amrani, A, Abajian, V and Kodratoff, Y. 2008. A chain of text-mining to extract information in archaeology. In: Annual IEEE Computer Conference, International Conference on Information and Communication Technologies: From Theory to Applications, and ICTTA (eds.), 3rd International Conference on Information and Communication Technologies: from Theory to Applications, 2008 ICTTA 2008; 7–11 April 2008, [Damascus, Syria], 1–5. IEEE. DOI: https://doi.org/10.1109/ICTTA.2008.4529905 

  2. Bojanowski, P, Grave, E, Joulin, A and Mikolov, T. 2016. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5(1): 135–146. 

  3. Byrne, K and Klein, E. 2010. Automatic extraction of archaeological events from text. In: Frischer, B, Crawford, J and Koller, D (eds.), Making History Interactive: Computer Applications and Quantitative Methods in Archaeology 2009, 48–56. Oxford: Archaeopress. 

  4. Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1): 37–46. DOI: https://doi.org/10.1177/001316446002000104 

  5. Concrete5. 2018. Concrete5 is a free CMS Open Source Content Management System. Available at: https://www.concrete5.org/ [Last accessed 27 June 2018]. 

  6. Council of Europe. 1992. European Convention on the Protection of the Archaeological Heritage (Revised). Available at: https://www.coe.int/en/web/conventions/full-list/-/conventions/treaty/143 [Last accessed 27 June 2018]. 

  7. Cunningham, H, Tablan, V, Roberts, A and Bontcheva, K. 2013. Getting More Out of Biomedical Documents with GATE’s Full Lifecycle Open Source Text Analytics. PLoS Comput Biol 9(2): e1002854. DOI: https://doi.org/10.1371/journal.pcbi.1002854 

  8. DANS. 2019. Easy. Available at: https://easy.dans.knaw.nl [Last accessed 27 June 2018]. 

  9. ElasticSearch. 2018. Theory Behind Relevance Scoring. Available at: https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html [Last accessed 27 June 2018]. 

  10. Epure, EV, Martín-Rodilla, P, Hug, C, Deneckère, R and Salinesi, C. 2015. Automatic process model discovery from textual methodologies. In: 2015 IEEE 9th International Conference on Research Challenges in Information Science (RCIS 2015), 19–30. IEEE. DOI: https://doi.org/10.1109/RCIS.2015.7128860 

  11. Evans, TN. 2015. A Reassessment of Archaeological Grey Literature: Semantics and paradoxes. Internet Archaeology, 40. DOI: https://doi.org/10.11141/ia.40.6 

  12. Gormley, C and Tong, Z. 2015. Elasticsearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine. Sebastopol: O’Reilly Media. 

  13. Jeffrey, S, Richards, J, Ciravegna, F, Waller, S, Chapman, S and Zhang, Z. 2009. The Archaeotools project: Faceted classification and natural language processing in an archaeological context. Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, 367(1897): 2507–19. DOI: https://doi.org/10.1201/b18530-15 

  14. Kintigh, KW. 2015. Extracting Information from Archaeological Texts. Open Archaeology, 1(1): 96–101. DOI: https://doi.org/10.1515/opar-2015-0004 

  15. Lafferty, J, Mccallum, A, Pereira, FCN and Pereira, F. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Brodley, CE and Pohoreckyj, DA (eds.), ICML ‘01 Proceedings of the Eighteenth International Conference on Machine Learning June 28 – July 01, 2001, 282–289. San Francisco: Morgan Kaufmann Publishers Inc. 

  16. Marrero, M, Urbano, J, Sánchez-Cuadrado, S, Morato, J and Gómez-Berbís, JM. 2013. Named Entity Recognition: Fallacies, challenges and opportunities. Computer Standards & Interfaces, 35(5): 482–489. DOI: https://doi.org/10.1016/j.csi.2012.09.004 

  17. Mikolov, T, Chen, K, Corrado, G and Dean, J. 2013. Efficient Estimation of Word Representations in Vector Space. Poster presented at the International Conference on Learning Representations 2013. Available at: https://arxiv.org/abs/1301.3781 [Last accessed 27 June 2018]. 

  18. Ministerie van Onderwijs, Cultuur en Wetenschap. 2007. Wet op de archeologische monumentenzorg. In: Staatsblad 42, Den Haag: Ministerie van Onderwijs, Cultuur en Wetenschap. Available at: https://wetten.overheid.nl/BWBR0021162/2008-01-01 [Last accessed 27 June 2018]. 

  19. Moretti, M, Sprugnoli, R, Menini, S and Tonelli, S. 2016. ALCIDE: Extracting and visualising content from large document collections to support Humanities studies. Knowledge-Based Systems, 111: 100–112. DOI: https://doi.org/10.1016/j.knosys.2016.08.003 

  20. Okazaki, N. 2007. CRFsuite: a fast implementation of Conditional Random Fields (CRFs). Available at: https://www.chokkan.org/software/crfsuite/ [Last accessed 27 June 2018]. 

  21. Paijmans, H and Brandsen, A. 2009. What is in a Name: Recognizing Monument Names from Free-Text Monument Descriptions. In: Van Erp, M, Stehouwerm, J and Van Zaanen, M (eds.), Proceedings of the 18th Annual Belgian-Dutch Conference on Machine Learning (Benelearn), 2–6. Tilburg: Tilburg Center for Creative Computing. 

  22. Paijmans, H and Brandsen, A. 2010. Searching in archaeological texts: Problems and solutions using an artificial intelligence approach. PalArch’s Journal of Vertebrate Palaeontology, 7(2): 1–6. https://www.persistent-identifier.nl/urn:nbn:nl:ui:12-4578352. 

  23. Paijmans, H and Wubben, H. 2008. Preparing archeological reports for intelligent retrieval. In: Posluschny, A, Lambers, K and Herzog, I (eds.), Layers of Perception. Proceedings of the 35th International Conference on Computer Applications and Quantitative Methods in Archaeology (CAA), 212–217. Bonn: Rudolf Habelt Verlag. 

  24. Pedregosa, F, Varoquaux, G, Gramfort, A, Michel, V, Thirion, B, Grisel, O, Blondel, M, Prettenhofer, P, Weiss, R, Dubourg, V, Vanderplas, J, Passos, A, Cournapeau, D, Brucher, M, Perrot, M and Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(Oct): 2825–2830. 

  25. RCE. 2017. De Erfgoedmonitor. Available at: https://erfgoedmonitor.nl/indicatoren/archeologisch-onderzoek-aantal-onderzoeksmeldingen [Last accessed 27 June 2018]. 

  26. Richards, J, Tudhope, D and Vlachidis, A. 2015. Text Mining in Archaeology: Extracting Information from Archaeological Reports. In: Barcelo, JA and Bogdanovic, I (eds.), Mathematics and Archaeology, 240–254. Boca Raton: CRC Press. DOI: https://doi.org/10.1201/b18530-15 

  27. Rijksdienst vvor het Cultureel Erfgoed. 2019. Archis Invoer. Available at: https://archis.cultureelerfgoed.nl [Last accessed 27 February 2019]. 

  28. Sorel, D. 2018. jQuery QueryBuilder. Available at: https://querybuilder.js.org/ [Last accessed 27 June 2018]. 

  29. Stichting Infrastructuur Kwaliteitsborging Bodembeheer. 2016. BRL 4000. Available at: https://www.sikb.nl/archeologie/richtlijnen/brl-4000 [Last accessed 27 June 2018]. 

  30. Tong, Z. 2018. elasticsearch-php. Available at: https://github.com/elastic/elasticsearch-php [Last accessed 27 June 2018]. 

  31. Van den Bosch, A, Busser, B, Canisius, S and Daelemans, W. 2007. An efficient memory-based morphosyntactic tagger and parser for Dutch. In: Van Eynde, F et al. (eds.). Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, 99–114. Leuven. 

  32. Van den Dries, M. 2016. Is everybody happy? User satisfaction after ten years of quality management in European archaeological heritage management. In: Florjanowicz, P (ed.), When Valletta meets Faro, the reality of European archaeology in the 21st century, proceedings of the International Conference, 126–135. Lisbon: Archaeolingua. 

  33. Van Gompel, M and Reynaert, M. 2013. FoLiA: A practical XML format for linguistic annotation – a descriptive and comparative study. Computational Linguistics in the Netherlands Journal, 3: 63–81. 

  34. Verberne, S, Boves, L and Van den Bosch, A. 2016. Information access in the art history domain: Evaluating a federated search engine for Rembrandt research. Digital Humanities Quarterly, 10(4). 

  35. Vlachidis, A. 2012. Semantic Indexing via Knowledge Organization Systems: Applying the CIDOC-CRM to Archaeological Grey Literature. Unpublished Thesis (PhD), University of South Wales. 

  36. Vlachidis, A and Tudhope, D. 2015. Negation detection and word sense disambiguation in digital archaeology reports for the purposes of semantic annotation. Program, 49(2): 118–134. DOI: https://doi.org/10.1108/PROG-10-2014-0076 

  37. Vlachidis, A and Tudhope, D. 2016. A knowledge-based approach to Information Extraction for semantic interoperability in the archaeology domain. Journal of the Association for Information Science and Technology, 67(5): 1138–1152. DOI: https://doi.org/10.1002/asi.23485 

  38. Vlachidis, A, Tudhope, D, Wansleeben, M, Azzopardi, J, Green, K, Xia, L and Wright, H. 2017. D16.4: Final Report on Natural Language Processing. Technical report, ARIADNE.