User Requirement Solicitation for an Information Retrieval System Applied to Dutch Grey Literature in the Archaeology Domain

In this paper, we present the results of user requirement solicitation for a search system of grey literature in archaeology, specifically Dutch excavation reports. This search system uses Named Entity Recognition and Information Retrieval techniques to create an effective and effortless search experience. Specifically, we used Conditional Random Fields to identify entities, with an average accuracy of 56%. This is a baseline result

This paper describes the work carried out in the first year of a PhD project on Text Mining in the archaeological domain.This project is in association with both the Faculty of Archaeology and the Data Science Research Programme (DSRP) at the University of Leiden, combining archaeological knowledge with the technical skills available in the Data Science department.
The work carried out in this project is motivated by the need of researchers in the archaeological field to be able to efficiently and effectively find information related to their research questions in the available grey literature.This requirement has been well documented in previous work (e.g.Richards, Tudhope & Vlachidis 2015; Van den Dries 2016) and some studies have investigated different applications of text mining from archaeological reports in English (Vlachidis and Tudhope 2016;Amrani, Abajian & Kodratoff 2008;Byrne and Klein 2010;Vlachidis and Tudhope 2015) and Dutch (Paijmans and Brandsen 2010;Vlachidis et al. 2017).
However no system is currently available that allows full-text access to a substantial part of the Dutch archaeological document collection.As a result, relevant and valuable information is not being utilised by researchers, mainly by those who are not experts in their (sub)field yet.Information like a single Bronze Age find in a otherwise Medieval site is unlikely to be mentioned in the metadata, and is thus nearly impossible to find.This is a problem from a theoretical point of view, as key information could be overlooked at the moment, information that could change archaeological interpretations.It also devalues the monumental effort that has gone into collecting, digitising, archiving and publishing these documents, as well as the legislation that has been drawn up surrounding the archiving of these documents.
More and more text mining, data mining and IR tools and techniques have become available over the last years, which could potentially provide a way to access and extract information from this wealth of data currently hidden in these reports.This, combined with the relatively easy access to higher computer processing power, makes a systematic implementation of text mining techniques for Dutch archaeological reports not only desirable, but also feasible.
In this project we are developing AGNES (Archaeological Grey-literature Named Entity Search), a search system that aims to make archaeological grey literature more accessible and searchable by applying IR techniques to this big dataset.
The goals of this paper are (1) to give an overview of previous work on text mining in archaeology, (2) to show the need for a search system by interviewing the user group, (3) soliciting user requirements for such a system, (4) presenting the results of the initial experiments with Named Entity Recognition (NER) and (5) presenting the indexing and front end software of the developed system.
Section 2 places the research in context, while section 3 provides some technical background information on the developed system.Section 4 discusses the user study and requirement solicitation.

Data
The data used in this research is a dump of all Dutch archaeology reports from the DANS repository that were available as PDF files in 2016.These include both scanned paper documents that have been OCRed and born-digital documents.It is expected that the OCRed documents contain errors to various degrees, which will complicate any efforts to apply text mining to them.We estimate that only about 15% of the dataset are OCRed PDFs, and all new documents will be born-digital, so this percentage will decrease over time.

Prior Work
Some experiments have been carried out in text mining in archaeology, across multiple countries and languages.This includes work by Epure et al. (2015), describing how to mine process models from natural text, and also recent works by Øyvind and Martin-Rodilla presented at EAA 2018, describing automatic information extraction from reports and semiautomatic analysis on heritage related legal texts, respectively (currently unpublished).Related work is also carried out in other disciplines such as history, a notable example being ALCIDE, a system that extracts and visualises content from large document collections in the history domain (Moretti et al. 2016).
More specifically, some work has been carried out on NER; the finding and classifying of concepts in text.In English, one of the earliest contributions is the work by Amrani, Abajian & Kodratoff (2008), which helped experts to extract information from archaeological literature.Byrne and Klein (2010) also investigated the extraction of information, but focused solely on event information.The OPTIMA system, described by Vlachidis (2012), used a rules-based approach to semantic indexing, including NER.Another notable project is Archaeotools in the UK, which combined databases with information extracted from reports in an interesting faceted browser interface (Jeffrey et al. 2009).A more recent paper is that by Kintigh (2015), which provides a detailed overview of the problems and possible solutions, but does not include the development of a search system.
For Dutch language reports, most of the previous research has been carried out by Paijmans with several collaborators, including extracting monument names from free text fields (Paijmans and Brandsen 2009) and the OpenBoek system, which used memory-based learning to perform NER (Paijmans and Wubben 2008;Paijmans and Brandsen 2010).Like the work by Byrne and Klein (2010), this project focused mainly on time periods, but also applied some rules-based NER to detect place names.The OpenBoek system included an online search interface during the CATCH (Continuous Access To Cultural Heritage) project, but unfortunately this is not available anymore.
More specifically, our project builds upon the text mining experiments performed by researchers of the University of South-Wales in the European ARIADNE project between 2013 and 2017.They applied a rules-based technique to the problem, utilising the GATE (General Architecture for Text Engineering) framework (Cunningham et al. 2013).A limited number of eight Dutch reports were analysed and compared to manually tagged 'gold-standard' documents as a proof of concept, next to English, Swedish and German reports.In the same project, the ADS (Archaeological Data Service) in the UK applied machine learning techniques to English grey literature, and developed an API that can automatically create metadata based on entered text (Vlachidis et al. 2017).
The contributions of this paper compared to previous work are twofold: (1) this system includes a user study which has not previously been undertaken, to collect the user needs for text mining in the archaeological domain; and (2) it combines the results of the NER with a full-text index in an effective search interface.
More broadly, this project is in cooperation with the DSRP, which gives us access to a high computing power cluster, allowing for the use of more computationally expensive techniques on bigger document sets.The length of this project is also an important asset; most previous experiments were often performed over a short amount of time, making it difficult to create a finished system, while this project takes place over four years with the specific aim of creating a user-friendly web application.
3 Introducing AGNES AGNES stands for Archaeological Grey-literature Named Entity Search, and is the name of the search system currently under development in this project, including both the front end of the web application, as well as the indexing software responsible for finding and indexing archaeological concepts.The logo of the system can be seen in Figure 1.The current version of the system (v0.2) is available at https://agnessearch.nl/index.php/search/.(Please note, free registration is needed to access the system.)The source code will be made available later in the project.

Named Entity Recognition
A standard full-text index, allowing researchers to search through all of the text instead of just the metadata, would already be an improvement on the current situation.However, such a full-text search would not account for synonymy and polysemy; multiple words that have the same meaning and one word having multiple meanings, respectively.See Table 1 for two non-exhaustive examples, where a full-text search would either not return all results, or return possibly wrong results.This is why NER is needed to accurately index these documents.
NER is a method that aims to identify and classify specific entities in natural language, also known as unstructured written text (Marrero et al. 2013).In the case of this project, the entities are archaeological concepts, and the natural language are excavation reports.
To give an example, in the following sentence the entities are bold: 'We found pottery dating from the Neolithic inside a rubbish pit', an artefact, a time period and a feature/context, respectively.
In the current version of the system, we used Conditional Random Fields (CRF) to train the named entity recogniser (Okazaki 2007).This is a form of machine learning specifically designed to label sequence data (Lafferty et al. 2001), a common choice for NER tasks as words in a sentence are sequential.We implemented the scikit-learn Python package (Pedregosa et al. 2011), using the default algorithm (gradient descent using the L-BFGS method).The input for this algorithm were manually tagged Dutch reports created in the ARIADNE project (Vlachidis et al. 2017), specifically selected to be a good sample of the corpus.In total, this training set consists of roughly 500,000 words, containing 11,000 tagged entities.Some issues with these documents are discussed later in this section.
The annotated .docxfiles were tokenised and Part Of Speech (POS) tagged 1 using Frog (Van den Bosch et al. 2007) and then converted to the FoLiA XML format ( Van Gompel and Reynaert 2013).These steps are needed as CRF requires the input to be tokenised and POS tagged.Subsequently, the documents were converted to the format scikit-learn requires; a list of tokens including the token's POS and category (or concept) tag.At the moment, only three archaeological categories are used as these have the most training data available: artefact, time period and material, although more categories will be added in later versions.For each token, the following features were extracted for the word itself, as well as the word before and after the current one:  When assessing the results of the NER, we discovered that there are some issues with the gold standard documents which could affect the accuracy.It seems that some tagging decisions were made that mean that entities are expanded to the left or right.For example, wherever the word 'before' or ' after' occurs before a time period, these words are included in the tag, while ideally these shouldn't be included as they aren't part of the time period itself (e.g.na de 3e eeuw ' after the 3rd century').If the NER then fails to classify these prefixes as the entity, the recall will be lower than the precision, which can also be seen in our results.
The artefact, time period and material wordlists were taken from the Archeologisch Basis Register (ABR), a thesaurus for Dutch archaeology maintained by the RCE.It contains phrases that are written in such a way that they do not match the way we would find these phrases in natural language.For example, the entry for 'doorboorde bijl' (perforated axe) is 'bijl, doorboord' (axe, perforated) in the thesaurus, making it difficult to match the two.These two issues are further discussed in section 5.
The code described in this section is available at DOI: 10.5281/zenodo.1238861.

Indexing & Front End
For this version of AGNES, 100 randomly selected reports from the DANS repository were indexed.For each page in these documents, the trained CRF model is used to extract the named entities.These are combined with the full text of the page and converted into a JSON structure, which can then be indexed directly by ElasticSearch (Gormley and Tong 2015), an open source search engine running on a web server.ElasticSearch uses JSON over HTTP to index and retrieve information, making it suitable for integration with other systems.The other advantage of using ElasticSearch is that it includes a number of features by default that are very useful for these kinds of search systems, including a result ranking system.
To query the index, a front end user interface has been developed.As a framework for the web application, the free and open source content management system Concrete5 was used (concrete5 2018).
To create a query, the user can use a query builder (Sorel 2018) that allows for boolean AND/OR logic.They can specify exactly which entity you are looking for in each part of the query, or select a general full-text search (see Figure 2).This allows for complex queries such as: artefact:scraper AND (period:neolithic OR period:mesolithic) AND fulltext:burnt which returns results on scrapers from the neo-or mesolithic that also mention 'burnt'.
The query is then converted to a JSON format by the front-end application, and the ElasticSearch index is queried using the ElasticSearch-PHP client (Tong 2018), resulting in a list of matching results.It is useful to rank and sort these results by relevance, so the documents that are most likely to be relevant to a query are at the top of the list.To do this, ElasticSearch calculates a score for each result, which is based on the importance of each query term that appears in that document (ElasticSearch 2018).
Once the results are displayed, the user can view a snippet of the text surrounding the keywords, preview the page of the report or go directly to the project archive in the DANS repository to download the PDF document.No PDFs are made available on the AGNES server to deal with the copyright of these files.A graphical representation of the full workflow of AGNES can be found in Figure 3, which also displays the split between pre-processing of the documents on a high-performance cluster, and the indexing and querying that takes place on a standard web server.

User Study
Part of this research includes a user study, to ensure the needs of the potential users are met.The focus group, as well as the methods and results of the first workshop, are detailed below.

Definition of target audience
To be able to make an effective search system, it is required to define the expected users of the system.As the main goal of this system is to make information available for research, the expected user is a researcher working in Dutch archaeology.These researchers can be in a variety of organisational levels, including academia, commercial archaeology and regional/national government.One of the main user groups expected to use this system are academics and people in higher education.However, this group is not homogeneous, as e.g. a professor will have much more in-depth knowledge and will already be aware of most of the literature and field reports related to their field, in stark comparison to e.g. a bachelor or PhD student who will still be exploring the literature and information available.Because of this difference in knowledge, these users will ask different questions of the dataset and in different ways.However, regardless of their knowledge level it is expected that academic researchers will generally be asking thematic questions of the dataset; questions about a particular time period, artifact type, context and/ or location.
Another main user group is researchers in Dutch commercial archaeology.While this group will also be interested in the documents, it is likely that they will mainly want to use the system to find all information about a particular geographic area.This is because the main use of these reports for commercial archaeologists is to create desk assessments (bureau-onderzoeken) and archaeological prediction/expectation maps (archeologische  verwachtingskaarten) about a specific area, generally because the area surrounds a potential building site.As some maps are also created by period, combined queries of place and time are also expected.There are three types of commercial archaeology, each are expected to have slightly different needs and requirements.These three types are inventorisation (investigating existing research), exploration or prospection (e.g.surveys and coring) and excavation (generally after the previous two types have been completed).
A third expected user group is municipal and regional (or provincial) archaeologists.Regarding their requirements, these will most probably fall in between academic and commercial archaeologists.While generally they will research a certain timespan in a particular area, it is likely that they will also want to research broader themes.However, generally they will be aware of all the available literature in their area already, so perhaps a search system is less useful for this group.
Researchers at the RCE are a fourth user group, and will probably have similar needs to municipal archaeologists, except they are working on a country wide geographical scale.These researchers will commonly work on nationwide synthesising research, combining the information from a large number of reports into a larger picture.
Outside of the archaeological sphere, it is possible that the system will also be used by historians researching specific time periods such as the Middle Ages, where there is an overlap between archaeology and history.It is expected these scholars will have similar requirements to archaeological academics.
Lastly, it is possible that this system might be used by amateur archaeologists, amateur historians, metal detectorists and other enthusiasts, for a variety of reasons.

Focus group
In order to collect the requirements of archaeologists in the Netherlands, a focus group was set up.Members of the focus group participated on a voluntary basis.This group's function at the start of the project is to provide their needs and wishes for a system like this, while in further stages of the project they can provide feedback on the developed features.The size and make up of this group is flexible, and can be changed during the project to fit with the current goals and/or address issues of representativeness.
The focus group has been selected to be as representative as possible for the Dutch archaeological landscape, taking into account the target audience definition from section 4.1.The group consists of 5 academics, 2 commercial professionals and 2 archaeologists working on different levels in government.See Table 3 for a more detailed break down of the participants.
No amateur researchers were selected for the focus group, mainly because they are not an intended user of the system, but also because their approaches to research are so wide ranging, it would be virtually impossible to assemble a representative group of people.

Prototype for discussion
From personal experience in commercial software development, as well as experiences from IR researchers in other fields (e.g.Verberne, Boves & van den Bosch 2016), it seems that users in general, but users from the humanities specifically, find it difficult to express their requirements, oftentimes resulting in broad requirements that are too vague to interpret and implement.This can be further compounded by a lack of understanding of what is technically possible, leading to overly optimistic or very cautious expectations.We therefore first created a prototype with limited functionality (as discussed in section 3) as a starting point for discussions, in order to elicit feedback that is more detailed and can be implemented properly.

Workshops
The focus group will gather once a year during the project, for a total of 4 workshops.The initial workshop has been conducted, with the main aim of soliciting the requirements of the users.Later workshops will focus more on assessing the system and its results.Minutes will be taken at each session to record the comments and feedback of the group, and these will be made public after anonymisation.
The first workshop started with an introduction to the problem, as well as some background information on IR and NER (see also section 3.1).The group was then asked what their current search behaviour is, and what problems they encounter, before being shown a prototype of the system (v0.2) and asked to provide feedback on both the functionality and the relevance of the results.
Finally, specific user requirements were discussed.A suggested list of features was provided to the participants, who then discussed amongst themselves in groups of 2 which features they would find most useful, on a scale of 0 to 3 with 0 being not useful or relevant at all, and 3 being very useful and high priority.The participants were also asked to think of features not currently on the list.

Results
From comments of the group, it was clear that the grey literature problem is very familiar to everyone present.Feedback on their current search behaviour showed that most people use the DANS search functionality (found at DANS 2019) and find it not sufficient for their search needs, with most people having to manually search through individual documents to find information.Some participants, instead of using DANS, usually ask experts in the field to provide them with references.The Archis 4 system is used to a lesser degree, again mainly because the search functionality is not sufficient.Some people explained that they create their own literature lists with keywords to be able to find materials previously accessed.Initial feedback on the prototype indicates that the users find the returned results relevant to their queries, however much improvement is needed on the front end, as further discussed in the next paragraph.
The results from the feature elicitation were interesting; unanimously, everyone agreed that indexing by chapter and section would be more useful than indexing by page or document, and that this should of be high priority in the further development of AGNES.Another high priority feature across the group was to implement searching by drawing a polygon on a map as well as plotting results on a map, an indication that archaeologists have a strong need for geographical search.Another interesting result is that in general, everyone preferred to get many results with some irrelevant documents, than to get a smaller set of documents that are all relevant, with the risk of missing some documents.This means that the recall of the system is more important than the precision, which needs to be taken into account in assessing the results of the NER as well as the overall system evaluation.For a full overview of the averaged result for each feature, please see Table 4.

Future Work
One of the goals of the project is to expand the corpus from just the DANS documents to also including documents from the RCE and the Koninklijke bibliotheek, and creating a pipeline or API that allows for new documents added to these 3 repositories to be automatically added to the index.The work discussed in this paper is the result of the first year of a 4 year project.Each year, a new version will be developed, tested, and assessed by the focus group.Part of this evaluation will include a threat to validity study.
The first issue that needs to be resolved is the training data for the NER.It seems that entities have been tagged sub-optimally for the NER task, and it is expected that improving the annotations will increase the accuracy of the model.We are currently enlisting the help of a group of archaeology students to re-tag these documents, and possibly tag new documents as well.We will have multiple people tag the same documents, so we can calculate the inter-rater agreement (Cohen 1960); a measure of to what extent two humans agree on the annotation, which is an indirect indication for the difficulty of the task.
The other problem that will be addressed are the shortcomings ABR wordlists (see in section 3.1 the example of 'perforated axe').We are currently in discussion with the Rijksdienst voor Cultureel Erfgoed (RCE), who manage these lists, to see if it is possible to add a new field for either the lemma of the word or to include multiple spellings of a word.After these two tasks have been completed, we will train the model again to see what difference these adjustments make.
Once that baseline has been established, we will integrate word embeddings as features, using word2vec (Mikolov et al. 2013) and fasttext (Bojanowski et al. 2016).These are both unsupervised machine learning techniques, that place words into a high-dimensional vector space based on their context in the text.The words can then be clustered using e.g.k-means clustering, with the idea that similar words are clustered together.See Figure 4 for a two dimensional (instead of high-dimensional) representation of this idea, where group 1 contains artefact types and group 2 contains materials.The advantage over using a word list is that related concepts not in the list, as well as misspellings of the concept, will also generally get assigned to the same cluster.Hopefully, this will increase the accuracy of the NER.Regarding new features of the front end, according to the focus group the map functionality is the most required, including searching on a map and displaying results on a map.We are in the early stages of implementing this functionality and will hopefully present this in a future paper.Integration with common GIS systems is another avenue of research.Another feature with high priority is to index the documents by chapter or section, instead of by page as is currently the case.
To further evaluate the system, we will apply future versions to archaeological case studies.The plan is to find a specific archaeological information need, e.g.find all Iron Age cremations in the Netherlands and their geographical positions.We will then compare the results from AGNES with what experts currently know about this topic, and see if a significant increase in knowledge can be detected, probably by calculating the difference and overlap in numbers.
Currently, the system is focused on reports in Dutch, but as this problem is prevalent across the world, we will attempt to make the system multi-lingual, or at least provide ways of easily adapting the system to other languages.

Conclusions
From the user study, it is clear that a system such as AGNES is highly desirable for Dutch archaeology.The features assigned highest priority by the focus group are fairly uniform, which makes planning a roadmap of features straight forward.The first tentative feedback from the focus group is that results in AGNES are relevant to the queries, but more needs to be done to improve the functionality of the system.
From a technical viewpoint, the NER using CRF with a basic feature list resulted in an overall accuracy of 56%; a good baseline to build on.Fixing the problems with the gold standard and wordlists, as well as introducing word embeddings as features, should increase the accuracy.
Overall, it seems that AGNES can address the problem of grey literature in Dutch archaeology, although this needs to be evaluated more thoroughly by comparing the results to expert knowledge.The systems developed should easily be adapted to other languages and areas as well.We are hopeful that AGNES will help archaeologists to answer their research questions more effectively and efficiently, leading to a more coherent narrative of the past.

Figure 2 :
Figure 2: AGNES front end screenshot showing a query for flint flakes from the neolithic or bronze age.

Table 2 .
As can be seen from this table, the average precision 2 is fairly high at 71%, but the recall 3 is much lower at only 48%.This means that 71% of the automatically labeled entities are correct, but only 48% of all present entities were found by the automatic labeling.

Table 2 :
Precision, recall and F1-scores for the 3 targeted entities, on a scale of 0 to 1.

Table 3 :
Overview of participants in focus group per category.

Table 4 :
Features and average scores (0-3) across focus group (n = 9), in decreasing order of average score.Facets mean the option for users to refine results by selecting metadata categories, as often found on online shopping websites.An asterisk (*) indicates a feature suggested by a user.