Combined Detection and Segmentation of Archeological Structures from LiDAR Data Using a Deep Learning Approach

Until recently, archeological prospection using LiDAR data was based mainly on expertbased and time-consuming visual analyses. Currently, deep learning convolutional neural networks (deep CNN) are showing potential for automatic detection of objects in many fields of application, including cultural heritage. However, these computer-vision based algorithms remain strongly restricted by the large number of samples required to train models and the need to define target classes before using the models. Moreover, the methods used to date for archaeological prospection are limited to detecting objects and cannot (semi-)automatically characterize the structures of interest. In this study, we assess the contribution of deep learning methods for detecting and characterizing archeological structures by performing object segmentation using a deep CNN approach with transfer learning. The approach was applied to a terrain visualization image derived from airborne LiDAR data within a 200 km2 area in Brittany, France. Our study reveals that the approach can accurately (semi-)automatically detect, delineate, and characterize topographic anomalies, and thus provides an effective tool to inventory many archaeological structures. These results provide new perspectives for large-scale archaeological mapping. CORRESPONDING AUTHOR: Alexandre Guyot Laboratoire LETG – UMR 6554, Université Rennes 2, FR alexandre.guyot@univrennes2.fr


INTRODUCTION
The past decade has seen an increasing interest in remotesensing technologies and methods for monitoring cultural heritage. One of the most relevant changes is the development of airborne light detection and ranging (LiDAR) systems (ALS). With the ability to measure topography accurately and penetrate the canopy, ALS has been a key tool for important archaeological discoveries and a better understanding of past human activities by analyzing the landscape (Bewley, Crutchley and Shell 2005;Chase et al. 2011;Evans et al. 2013;Inomata et al. 2020) in challenging environments.
Most archaeological mapping programs based on ALS do not use LiDAR 3D point clouds directly, but use instead derived elevation models that represent bare soil in the topographic landscape. Perception of the terrain is usually enhanced by specific visualization techniques (VT) (Bennett et al. 2012;Devereux, Amable and Crow 2008;Doneus 2013;Hesse 2010;Štular et al. 2012) that are used to visually interpret landforms and archaeological structures (Kokalj and Hesse 2017). These visualizations have resulted in better understanding of the human past in different periods and different regions of the world. For example, LiDAR-derived terrain combined with VT has been used to provide new insights into a prehistoric hillfort under a woodland canopy in England (Devereux et al. 2005), discover a pre-colonial capital in South Africa (Sadr 2019), supplement large-scale analysis of a humanmodified landscape in a Mayan archaeological site in Belize (Chase et al. 2011) and explore long-term humanenvironment interactions within the former Khmer Empire in Cambodia (Evans 2016). However, these expert-based and time-consuming approaches are difficult to replicate in large-scale archaeological prospection projects.
A variety of (semi-)automatic feature-extraction methods have been developed to assist or supplement these visual interpretation approaches. Object-based image analysis (Freeland et al. 2016) and templatematching (Trier and Pilø 2012) methods, which rely on prior definition of purpose-built spatial descriptors or prototypical patterns, respectively, are difficult to generalize because they cannot include the high morphological diversity and heterogeneous backgrounds of archaeological structures (Opitz and Herrmann 2018). Supervised machine-learning methods have been assessed to address these limitations (Lambers, Verschoofvan der Vaart and Bourgeois 2019). Data-driven classifiers (e.g. random-forest, support vector machine) applied to multi-scale topographic or morphometric variables have provided interesting results for detecting archeological structures (Guyot, Hubert-Moy and Lorho 2018;Niculit , ă 2020). However, detection was either performed at the pixel level without considering the target as an entire object (archaeological structure) with spatial aggregation and internal complexities, or was based on previous image segmentation, which prevents them from being applied to complex structures. In recent years, deep learning Convolutional Neural Networks (deep CNNs) have resulted in a new paradigm in image analysis and provided groundbreaking results in image classification (Krizhevsky, Sutskever and Hinton 2012) or object detection (Girshick 2015). Deep CNNs are composed of multiple processing layers that can learn representations of data with multiple levels of abstraction (LeCun, Bengio and Hinton 2015). In the context of LiDAR-based archaeological prospection, they were first applied in 2016 (Due Trier, Salberg and Holger Pilø 2016) to detect charcoal kilns and were further evaluated in different archaeological contexts and configurations (Caspari and Crespo 2019; Gallwey et al. 2019; Kazimi et al. 2018;Trier, Cowley and Waldeland 2018;Verschoof-van der Vaart et al. 2020;Verschoof-van der Vaart and Lambers 2019). These studies focused on image classification (predicting a label/class associated with an image) (Figure 1a) or object detection (predicting the location (i.e. bounding box (BBOX)) of one or several objects of interest within the image) (Figure 1b). While these deep CNN methods have detected archaeological structures adequately, they could not provide information that (semi-)automatically characterized them because structures must be delineated to move from detection to characterization. Recent deep CNN methods, such as Mask R-CNN (He et al. 2017), have object-segmentation abilities (Figure 1c) that delineate objects. These deep CNN methods remain strongly restricted by the large number of samples required to train models and the need to define target classes before using the models. While the lack of ground-truth samples (reference data) is a known constraint in remote-sensing archaeological prospection, two strategies can address this limitation: transfer learning and data augmentation. The first strategy applies a pre-trained source domain model to initialize a targeted domain model (Weiss, Khoshgoftaar and Wang 2016), while the second strategy uses transformers that modify input data for training. These strategies are known to improve model performance for small datasets and to increase model generalization (Shorten and Khoshgoftaar 2019). Defining target classes before using a model is based on one-class approaches that define only a generic "archeological structure" class without dividing it into several sub-classes, assuming that the object characterization can identify types of archaeological structures.
Using deep CNN for archaeological prospection of LiDAR derived-terrain (Caspari and Crespo 2019; Gallwey et al. 2019; Küçükdemirci and Sarris 2020;Soroush et al. 2020;Trier, Cowley and Waldeland 2018;Verschoof-van der Vaart et al. 2020; Verschoof-van der Vaart and Lambers 2019) is in its infancy, and to our knowledge, these studies have not evaluated the object-segmentation abilities of the CNN, except the evaluation of Mask R-CNN for simple circular-based landforms (Kazimi, Thiemann & Sester 2019;Kazimi, Thiemann & Sester 2020). In the present study, we assess the contribution of deep CNN to the combined detection and segmentation of archeological structures for further (semi-)automatic characterization.
More specifically, we aim to provide new insights into object segmentation using deep CNN for archaeological prospection to address two key issues: i) the extent to which the approach is sensitive to the amount of sample data, since data are a sparse resource in archaeology, and ii) after object detection, the utility of object segmentation for characterizing archaeological structures.

STUDY AREA
The study area (Figure 2) is located in southern Morbihan (Brittany, France) and covers an area between the Ria  of Etel and the Rhuys Peninsula on the Atlantic coast. The region is a complex and fragmented mosaic of landscapes. The hinterland is composed of woodlands, moorlands and farmland that form a rural environment oriented to agriculture. The coastal area is also diverse, with estuaries and small islands near the intricate Gulf of Morbihan and large open, sandy areas in the Bay of Quiberon that concentrates most of the economic activities of tourism and fisheries.
The area is home to a unique megalithic heritage. Erected between the 5th to 3rd millennia BC, the Neolithic architecture (standing stones and megalithic tombs) represents an exceptional corpus of archaeological sites that are candidates for the UNESCO World Heritage List. Beyond this emblematic heritage, the coast of Morbihan includes a wide variety of archaeological sites that marked the gentle topography of the area and encompass different prehistorical and historical periods.

LiDAR-derived visualization image
The workflow for processing LiDAR data consisted of several steps (Figure 3). The image dataset was derived from a LiDAR point-cloud collected over the area in 2016 (200 km², excluding water area). The raw point-cloud was collected from a bispectral (1064 and 532 nm) Optech Titan LiDAR sensor operated from a fixed-wing vector 1300 m above ground level at a pulse repetition frequency of 300 kHz per channel and a 26° field of view to obtain a nominal point density of 14 points/m². The 3D point-cloud recorded was processed with LasTools (rapidlasso GmbH, Gilchin, Germany) to perform groundfiltering and gridding operations to create a Digital Terrain Model (DTM) at a spatial resolution of 50 cm (Guyot, Hubert-Moy and Lorho 2018). The terrain model was then used to perform two VTs.
First, a multiscale topographic position (MSTP) image (Lindsay, Cockburn and Russell 2015) was created based on a previous archaeological prospection study (Guyot, Hubert-Moy and Lorho 2018). The MSTP image was generated from a hyperscale datacube (30 bands corresponding to 30 window sizes) of the topographic metric DEV (deviation from mean elevation) (Wilson and Gallant 2000) and reduced to three dimensions by extracting the absolute maximum value from micro, meso, and macro scale ranges, which had window sizes of 3-21, 23-203 and 223-2023 px, respectively. Second, a morphological VT was created by combining a redtoned elevation gradient (slope) and a greyscale positive/ negative topographic openness based on Chiba, Kaneta and Suzuki (2008). Finally, MSTP and morphological VT were blended into a single composite image using a soft-light blending mode with 100% and 70% opacity, respectively.
The resulting enhanced multiscale topographic position (eMSTP) image (Figure 4) was proposed as an optimal VT for this study. It provided effective and informative multiscale visualization of archaeological structures (Guyot, Hubert-Moy and Lorho 2018) and enhanced perception of local morphological characteristics of the terrain (a known limitation of MSTP (Guyot, Hubert-Moy and Lorho 2018)). A 3-channel image was used as input of the network to facilitate transferlearning from models trained on natural RGB images. eMSTP images were cropped from the overall mosaic as 150 images, 512 px × 512 px in size, to be input into the deep CNN architecture and cover the annotated archaeological sites.

Archaeological annotated reference data
The reference dataset consisted of 195 georeferenced polygons that represented footprints of known archaeological sites in the study area. The sites were selected from the regional archaeological reference dataset provided by the Service Régional de l'Archéologie (SRA Bretagne). Only archaeological structures of which topographic characteristics could be perceived on the LiDAR-derived DTM were kept (thus excluding sites related to small-object deposits, such as potsherds, Figure 3 Image dataset's workflow from DTM to enhanced Multiscale Topographic Position (eMSTP) image. and sites considered as above-ground structures with no influence on the bare-earth topography, such as standing stones).
The selected archaeological sites had diverse chronologies, morphologies, and landscape contexts. Their state of conservation also varied greatly, from long-known restored monuments to unexcavated little-documented structures. The reference dataset included 195 archaeological structures, including 176 funeral structures attributed to the Neolithic, 10 funeral structures attributed to protohistoric periods, 1 motte, 3 promontory forts and 5 ruined windmills.
Given the highly imbalanced dataset (overrepresentation of Neolithic structures) and the tasks to evaluate (object detection and segmentation), the annotations were intentionally grouped into a single "archaeological structure" class with no further distinction. The reference dataset was converted from a geospatial format to an annotation one (json COCO) in which each annotation was associated with its corresponding eMSTP tile to be input into the deep CNN architecture. Due to spatial proximity between some archaeological sites, 150 eMSTP images covered the 195 annotations (a mean of 1.3 annotations per image).

Overall workflow
From the eMSTP images input, the overall workflow ( Figure 5) of the approach consisted of two main parts: • Object detection and segmentation • Object characterization

(Semi-)automatic object detection and segmentation
We used the open-source implementation of Mask R-CNN developed by Matterport (Abdulla 2017). The featureextraction phase (backbone) was performed using the Resnet-101 deep CNN initialized with weights pre-trained on the COCO dataset (Lin et al. 2014) for the transferlearning strategy.
To limit overfitting due to the small training dataset, data augmentation (DA) was activated in the Mask R-CNN workflow using the imgaug library (Jung et al. 2020). For each epoch, input images were randomly augmented with affine transformations (scaling: 80-120% of the original image size; translation: -20% to 20% of the original image position; rotation: -25° to 25° of the original image orientation). These transformations were defined within limited ranges of scaling, translation and rotation to avoid unrealistic versions of the eMSTP images. The augmentation process was applied 50% of the time to ensure that the deep CNN received both augmented and non-augmented versions of the training dataset.
A specific sampling strategy was used to assess the model's stability (varying training/validation/test draws) and sensitivity to the number of training samples (varying training size). The initial dataset of 150 images was randomly split into 110, 20 and 20 images for training, validation and testing, respectively. This random split was performed 10 times to create 10 different experimental datasets (different draws). For each experimental dataset, the training dataset was divided into 11 sub-training datasets with 10-110 images, with an increment of 10. Given the number of experimental datasets and sub-training datasets, a total of 110 experimental configurations were available (see Appendix A.1). Each experimental configuration was checked to ensure that no leaks occurred between validation, test and training datasets. Many hyperparameters can be calibrated in Mask R-CNN. To reduce specific effects and focus on the generalized behavior of the model, only a few hyperparameters were configured. The Region Proposal Network (RPN) was configured to consider the size and aspect ratios of objects of interest by setting RPN_ANCHOR_SCALES = [16,32,64,128] (in px) and RPN_ ANCHOR_RATIOS = [0.5, 1, 2] (width/height ratio). The training was performed on 60 epochs with a decaying learning rate (LR) schedule (training stage 1:20 epochs at LR 10 -3 , training stage 2:20 epochs at LR 10 -4 , training stage 3:20 epochs at LR 10 -5 ). To consider the variability in training size (10 -110 images, depending on the experiment), the number of iterations per epoch (STEP_PER_EPOCH parameter) was dynamically adjusted to the number of training images available at the beginning of each experiment (assuming a batch size of 1, and 1 image per GPU). This configuration ensured that the deep CNN observed each image (or its augmented version) only once per epoch.
The training process was set to fine-tune the head layers of the network (RPN, classifier and mask) (the other layers were frozen) to maximize use of transfer learning within the backbone network. The validation dataset was used to monitor the loss at the end of each epoch. For each experimental configuration, the model was run in inference mode to predict results from the test dataset (20 images). The inference returned a BBOX, confidence score and binary mask (or segment) for each object detected in the images of the test dataset.
Model performance was evaluated both statistically and visually. Predictions were assessed statistically per experiment by using metrics adapted to object detection and segmentation. The AP (average precision) for an IoU (intersection over union) threshold of 0.5 was used to assess each image and averaged as mAP to assess each dataset of the experimental configurations.
IoU refers to the overlapping score of the predicted mask compared to the reference data: area of intersection IoU area of union = (1) AP refers to the area under the precision-recall curve, with: with TP and FP and FN the true positives, false positives and false negatives, respectively. mAP@IoU v refers to the mean APs at a IoU threshold v for a given dataset with: with n the number of images i for a given dataset. Visual analysis was then performed to compare reference data and model predictions for each image for three case studies.
To assess the approach within an archaeological prospection scheme, we trained an additional deep CNN model (the deployment model) using all possible reference data (i.e. 150 images). The deployment model was applied to an independent set of images of the study area that did not contain any known archaeological structures that are topographically visible. The model was evaluated through human-interpretation and field survey.

Characterization of segmented objects
The results of (semi-)automatic detection and segmentation (i.e., predicted masks) were used to evaluate object characterization (morphological and contextual characterization). Predicted masks (polygons) were used as base units to calculate simple morphometric descriptors ( Table 1) and extract hyperscale topographic position signatures of the segmented objects (see the LiDAR-derived visualization image section for details on the hyperscale datacube).

OBJECT DETECTION AND SEGMENTATION PERFORMANCES
3.1.1. Sensitivity of deep CNN to the amount of sample data The overall performances of the deep CNN approach applied to 110 experimental datasets (i.e. 10 datasets × 11 training sizes) were measured using the mean average precision (mAP) metric. The creation of the experimental datasets from 150 images and the evaluation metric (mAP@IoU.5) used to assess performance are described in the Materials and methods section.
The mAP@IoU.5 ranged from 0.29 (experiment F train10 ) to 0.77 (experiment A train80 ), with a mean of 0.50 and standard deviation of 0.10 (Figure 6a and 6b). The sensitivity analysis of the number of training images available (Figure 6b) showed that mean mAP@IoU.5 increased from 0.37 to 0.55 as the number of training images increased from 10 to 110, respectively. Mean mAP@IoU.5 varied greatly among datasets (Figure 6c), with the mean mAP@IoU.5 ranging from 0.40 (dataset E) to 0.69 (dataset A).

Detailed analysis of three case studies
Predictions for object detection and segmentation compared to the reference dataset from a per-image analysis are illustrated (Figure 7) for three areas (Area 1, Area 2, Area 3). Models A train110 (maximum training size) and A train10 (minimum training size) were used as contrasting examples.
The low-trained model (A train10 ) and high-trained model (A train110 ) performed well in this area, with 3/3 matches (AP@IoU.5 = 0.92 and 1.0, respectively) (Figure 8). A train10 predicted five objects (Figure 7c) that corresponded to three known archaeological structures. However, for the two objects with the lowest IoU values (obj. 3 (0.66) and 5 (0.31)) the predicted BBOXs influenced the predicted   .mask. While obj. 3 converged to a correctly adjusted segment by leveraging the segmentation phase within a BBOX larger than the target, obj. 5 resulted in an excessively small segment bounded by an excessively small predicted BBOX. A train110 also predicted five objects (Figure 7d); the three with the highest confidence scores corresponded to the three known archaeological structures. The other two objects (obj. 4 and 5), which had lower confidence scores (0.85 and 0.74 respectively), were local topographic anomalies assumed to be due to recent (contemporary period) forestry operations. The quality of the predicted segments was confirmed using available archaeological documentation and in-situ photos (Figure 9). Both the low-trained model (A train10 ) and high-trained model (A train110 ) were able to predict the presence of the archaeological structure (AP@IoU.5 = 1.0). A train10 predicted two objects (Figure 7g); the BBOX with the highest confidence score (0.86) corresponded to the motte's location. The second BBOX (confidence score 0.74) was a false positive most likely due to an irregularity in the interpolated DTM that was visible on the enhanced multiscale topographic position (eMSTP) image on the surface of a lake.
A train110 predicted a single object with a confidence score of 1.00 at the motte's location (Figure 7h). While the predicted mask (770 m²) was slightly larger than the object that had been drawn manually based on the reference data (690 m²), it represented the compact ovoid shape (Figure 10a) of the archaeological structure better. Topographic analysis across the predicted mask identified a visible external ditch and an internal embankment (Figure 10c and 10d).

Area 3: Le Net, St Gildas de Rhuys
Area 3, located at Le Net (Saint Gildas de Rhuys, France), has a Neolithic passage grave 21 m long registered as a National Historic Monument since 1923  ( Figure 11a). The site, located in an agricultural field and covered by vegetation and bushes (Figure 11b, 11c), is identified as Clos Er Bé 1 (56 214 0004) on the national archaeological map.
A train10 predicted that the monument was contained in one (obj. 1) of the three objects detected (Figure 7k). However, visual analysis revealed that obj. 1 was a large (>3 ha) irregular stain that covered most of the image.
The commission error associated with this single object was 99%.
A train110 predicted also three objects (Figure 7l). The passage grave was predicted (obj. 3) with a confidence of 0.93 and an IoU of 0.79, indicating that it corresponded to the footprint of the archaeological structure provided by the reference dataset. The other two objects (obj. 1 and 2), which had higher

OBJECT CHARACTERIZATION: INITIAL RESULTS
As mentioned, the (semi-)automatic process of the deep CNN provided two levels of information: (i) the location of the objects of interest (BBOX, associated with a confidence score) and (ii) a mask that describes the shape of each predicted object. The latter information was used to characterize the context and morphology of the detected and segmented objects.
This approach was applied to the archaeological site of Park Er Guren (Figure 12), which is located east of the Bay of Saint Jean in the commune of Crac'h. The site contains two dolmens separated by 25 m in a north-south orientation that were registered as National Historic Monuments in 1926. The model predicted the presence of two objects (Figure 13). Hyperscale topographic position signatures (Figure 14) and morphometric descriptors ( Table 2) were calculated for the masks of both objects.  The hyperscale topographic position signatures and morphometric descriptors were then used to provide a data-driven description of the predicted objects, which was then compared to the archaeological reference dataset and additional archaeological documentation (Gouezin 2017;Le Rouzic 1933) as follows: -Object 1 was a pseudo-circular element 16-20 m in diameter composed of two main topographical units (groups of signatures). The first unit largely dominated its environment at the mesoscale (10-100 m) and, to a lesser extent, macroscale (100-1000 m). The second unit, with only few pixels, had a negative value of topographic position at the micro-/meso-scale, indicating the presence of a pit or trench. This object corresponded to Dolmen A and described its visible inner structures (e.g. corridor, central position of the chamber). The dolmen's topographically dominant position is characteristic of other Neolithic funeral monuments in the area.
-Object 2 was a large piriform element 64 m long and 37 m wide that varied greatly in topographic positions. Its mean topographic position became progressively dominant at the meso-and macro-scales, while not being the most dominating element within windows wider than 500 m. A few signatures were highly negative at the microscale, indicating the presence of local depressions within the object's footprint. The complex combination of signatures reflects the multiple topographical units in this piriform mound. The reference data did not describe this complex structure (thus making it statistically a false positive or commission error), but the object suggested an elongated tumulus associated with the dolmens. In addition to the mound, analysis of the hyperscale topographic position signatures suggested topographically visible pits that may correspond to (i) the chamber of Dolmen B and (ii) modern excavation areas visible on the western flank of the mound (Figure 13). Locally (micro-and mesoscales), dominating signatures highlighted the presence of the north-south oriented embankment on top of the mound.

EVALUATION OF THE APPROACH WITHIN AN ARCHAEOLOGICAL PROSPECTION SCHEME
The results of the deployment model showed predicted potential structures with confidence scores ranging from 0.5 to 1. These prediction results highlighted the pixel to object aggregation capability of the deep CNN approach, and predicted object sharing shape and size characteristics with the reference data used to train the model. The predicted objects were visually interpreted on the eMSTP image using two additional study sites using the eMSTP imagethat were not included for model training, validation or testing.

Analysis on the area of Goah Leron, Carnac (France)
Objects A and B were considered as interesting structures for further field verification. Object A with a circular shape   (16 m diameter) and low positive elevation (less than 0.3 m above surrounding terrain) showed a rough texture on the eMSTP image, typical of undergrowth vegetation under dense canopy (Figure 15). Object B with a pseudocircular shape (36 m diameter) and a positive elevation of 0.8 m above surrounding terrain, shared the same eMSTP characteristics. It is to be noted that the presence of standing stones (not visible on the LiDAR-derived DTM) is attested between object A & object B, thus supporting the idea of the possible presence of Neolithic burial mounds nearby. Object C was considered as a false-positive. This object corresponded to a north-south orientated terrain depression of 12 m wide, 46 m long and 40 cm deep that shared similarities with the representation of some elongated tumulus in the eMSTP image. This was mostly due to the conversion of the topographic metric DEVs from relative to absolute values during the calculation of the eMSTP image.
Object D was also considered as a false positive. This object, which corresponded to a horse training arena with flat elevation and surrounding embankments, shared shape characteristics with reference data, but not topographic or texture characteristics.
The model did not predict any potential structure on the hill located North-East of the area (point E). While the yellow-reddish color in the eMSTP image -associated to the meso-macro dominating topographic signaturecorresponded to the specific position of many tumulus in the study area, the model did not predict any object, which was probably due to the absence of local morphological anomalies.

Figure 15
Example of prediction results outside the reference dataset, Goah Lêron area, Carnac (France). Objects A and B were considered as interesting structures for further field investigation based on human-interpretation of the eMSTP image. Objects C and D were considered as false positives. Point E highlighted the fact that no potential structure was predicted on the hill. The remaining objects (small isolated mounds) would require further investigation.
The remaining predicted objects were isolated small mounds (4 to 6 m in diameter) less than 1 m high, most of them being located in open agricultural areas. While it was not possible to determine their nature only based on the interpretation of the eMSTP image further investigation would be required to identify them.

Analysis on the area of Brahen, Carnac (France)
Objects A and B were identified as archaeological entities. Object A was a circular mound (26 m diameter) with positive elevation of 0.8 m above the surrounding terrain (Figure 16). The field verification confirmed the probable archaeological nature of this structure as a tumulus, with a possible attribution to the Bronze Age based on its morphology. Object C corresponded to a dominating terrain covered by dense vegetation with a morphological anomaly on its highest position (Object B). In the field, remaining elements of a possible megalithic stone alignment were identified at this position.
Object D was considered as a false-positive. This object corresponded to a narrow ditch with east-west orientation that shared similarities with the representation of some elongated tumulus in the eMSTP image. This detection error could be due to the conversion of the topographic metric DEVs from relative to absolute values during the calculation of the eMSTP image.
The remaining predicted objects corresponded to local morphological anomalies that would require further investigation.

Figure 16
Example of prediction results outside the reference dataset, Brahen area, Carnac (France). Objects A and B corresponded to archaeological structures confirmed by field verification, object C being a dominating terrain including object B. Object D was considered as false-positive. Remaining objects were local morphological anomalies that would require further investigation.

SENSITIVITY OF THE APPROACH AND GENERALIZATION ABILITY
The deep CNN approach resulted in high detection and segmentation performances (mAP up to 0.77) with relatively small training datasets. The largest training dataset contained 110 images, which is small training set for deep learning. This confirms the approach's ability to perform well in archaeological contexts in which sparse reference data are a common limitation. Nonetheless, the model's sensitivity to the images selected for the training and test datasets (with mAP@ IoU.5 varying from 0.29 (model E train110 ) to 0.77 (model A train110 ) for the same number of training images) raises some concerns. A previous study that focused on (semi-) automatic archaeological mapping (Verschoof-van der Vaart et al. 2020) also mentioned this sensitivity. Some of the variability is related to the metrics used to evaluate detection and segmentation performances, but the main sources of variability seem to be the images selected for model evaluation (the complexity of the test dataset) and training (whether the training dataset is representative and comprehensive) (Soroush et al. 2020).
The deep CNN approach showed adaptability in detecting and segmenting different archaeological structures within the region. However, model training and evaluation were limited to a region that has particular topographic and archaeological characteristics. Most of the archaeological structures contained in the reference dataset have a topographically dominant position (burial mounds, hillforts, wind mills), but their local dominance is highly variable in magnitude and scale. While the trained models detected most above mean elevations (e.g. roundabout), they differed from local maximum detectors on their ability to consider the following archaeological landscape characteristics: the multiscale topographic position of the sites (maxima at specific local neighborhood or scale) and the local morphological patterns of archaeological structures. As confirmed by the results obtained using the deployment model applied on an independent set of images of the study area, these characteristics were learned during the training phase and used for prediction. This demonstrated the generalization capabilities of the approach in the geoarchaeological context of the study area.
The limits of the deep CNN approach were also identified. Beside prediction errors that were expected (e.g. roundabout), errors were also observed for objects sharing few or no similarities with the reference dataset (e.g. horse training area, large ditch). Such undesired behavior of the deep CNN models raised the question of negative training (i.e. providing the model with negative examples during training). While this was not implemented in the Mask R-CNN framework used in this study, it should be addressed in future works to improve prediction performances, for example using software frameworks that handle negative training for instance segmentation, such as Detectron2 (Wu et al. 2019).
More generally, results showed that a particular attention should be paid to the selection of training examples. The sample selection strategy is still a challenging concern especially with the hidden and non-intuitive phenomena related to deep CNN. Tools that facilitate insights into model successes and failures such as Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al. 2020) could be used to tackle such concern.
Further investigation of the multiple hyper-parameters and model configurations of deep CNN architectures would be helpful to assess the scope and limits of the approach. As an example, data augmentation (DA) was empirically used to improve model performances and generalization capabilities (Shorten and Khoshgoftaar 2019). The evaluation of DA was not included in this study, because a comprehensive assessment would involve a full-fledged study (evaluation of performances with and without DA, and with multiple DA configurations involving various combinations of DA techniques). Although we did not perform this comprehensive evaluation, we evaluated DA effect on a single model (A train110 ) trained without and with data-augmentation using a performance test. Results showed an increase of the mAP@IoU0.5 performance from 0.64 to 0.75.
Assessing the overall generalization ability at a larger geographical scale (spatial generalization) and for more types of archaeological structures (typological generalization) would require further experiments. First, to assess spatial generalization, a pre-trained model could be used to identify topographical anomalies that have characteristics similar to those on the coast of Morbihan using the LiDAR dataset of relevant regions in the world. Second, to assess typological generalization, the model could be retrained to include new types of structures to increase the diversity of archaeological contexts assimilated by the deep CNN. These strategies would benefit from public benchmark dataset targeted to detect archaeological sites from remotely sensed data.

EVALUATION METRICS FOR AMBIGUOUS REFERENCE DATA
The results indicate that statistical assessment of the models provided an objective metric of the quality of predictions, but it did not completely capture the approach's performance because the overall mAP hides local discrepancies that could be identified only through case-by-case visual analysis of model predictions. The metrics used for object detection and segmentation were based on an overlap measurement (i.e., IoU) that was a threshold for determining a match or nonmatch. However, the complex relation between remotely sensed archaeological information and comprehensive archaeological information (e.g. excavation and field reports, archives) is not considered regardless of the threshold value (i.e. one or more values). The definition of reference data frequently raises issues in archaeological mapping, such as how remote sensing perceives the footprint of a known archaeological structure or diffuse footprints, such as large artificial mounds that have been eroding for thousands of years.
Similar concerns also arise for detecting undiscovered archaeological structures. A false detection by machine learning could become a true positive after in-situ verification. Therefore, a liberal strategy (rather than a conservative strategy) is required to define the detection thresholds (related to the confidence score and overlap measurement), which allows for a certain number of false negatives. This study's examples of false-positive detections (Figure 7d and 7l) are representative of this intentionally liberal strategy, with topographical structures detected (i) correctly because they share characteristics with known archaeological structures and (ii) incorrectly because they are ultimately interpreted as contemporary human earthworks that are not considered of archaeological importance. Such a strategy can be justified to detect a maximum number of potential structures, as long as the prediction corresponds to a relevant response from the deep CNN considering the input examples it was trained on. Then, potential structures are interpreted based on human expertise.
These issues highlight that the current evaluation metrics, which originated from computer-vision and image-analysis domains, are only partially adapted to archaeological mapping. This could be considered in future studies such as by using fuzzy approaches.

ONE-CLASS APPROACH AND POST-DETECTION CHARACTERIZATION: POTENTIAL FOR A NEW PARADIGM FOR (SEMI-) AUTOMATIC MAPPING IN ARCHAEOLOGY
Most approaches in machine learning-based archaeological mapping use a pre-defined nomenclature (e.g. barrows, charcoal kilns, Celtic fields, burial mounds, mining pits) to consider local archaeological characteristics (e.g. site morphology, chrono-typological relation, spatial relationship). However, a standard and consensual typology appropriate for remotely sensed archaeological structures that span time and space remains a concern (Tarolli et al. 2019). Moreover, classes are often distributed unequally (i.e. datasets of archaeological structures with a lack of samples for certain classes).
We used a one-class rather than multi-class approach to address these two issues because we assumed that the deep CNN would have higher generalization abilities (i.e. depend less on target type and variety) with a one-class approach. This was confirmed by the results obtained for the Er Castellic motte, whose structure type was not included in the training dataset. Although this artificially elevated terrain monument was the only example of its type in the study area, it was sufficiently similar to a tumulus for the model to detect it as an object of interest. These topographical and morphological similarities with certain tumulus were mentioned in an archaeological prospection report (Brochard 1994) and reinforced our assumption. Indeed, from a LiDAR perspective, archaeological sites of different chronologies and typologies share patterns that the deep CNN can discover and extract.
The characterization phase, based on the objectsegmented mask and data-driven description, provides information that can help to identify the nature of the archaeological structures. For example, characterization of the detected objects and segmented at the Park Er Guren site made it possible to identify a tumulus and related dolmens. Although more examples are required to confirm this assumption, this approach provides new perspectives by inversing the common conceptual model in remote-sensing archaeological mapping in which a typology of target options must be defined before (semi-) automatic detection.

CONCLUSION AND OUTLOOK
We demonstrated potential methods that can detect and characterize archeological structures by performing object segmentation using a deep CNN approach combined with transfer learning. Our study reveals that the approach developed can be used to (semi-)automatically detect, delineate and characterize topographic anomalies. The results, compared to archaeological reference data collected from archaeological documentation, showed detection accuracy (mAP@IoU.5) up to 0.77 and provided new perspectives for archaeological documentation and interpretation through morphometric and contextual characterization via object segmentation. The one-class detection method combined with a characterizationinterpretation strategy provides a new paradigm for prospecting archaeological structures in varying states of conservation or with conflicting typologies. The application of such a deep CNN approach to large scale archaeological mapping in wider geographical and archaeological contexts still needs to be extended and assessed. Beside the necessary addition of a new set of reference data covering various geo-archaeological situations, this would also involve the development of methods for the optimal selection of training samples. It would also involve further investigation on the effectiveness of the LiDAR-derived VT as input to the automatic detection and segmentation processes. In this regards, the objective evaluation metrics provided by the deep CNN approach could be used for the benchmarking of new and existing VTs.