Automatic Extraction and Labelling of Memorial Objects From 3D Point Clouds

This research addresses the problem of automatic extraction of memorial objects from cultural heritage sites represented as scenes of 3D point clouds. Point clouds provide a fine spatial resolution and accurate proxy of the real world. However, how to use them directly is not always obvious. This is especially true for applications where extensive training data or computational resources are not available. In this paper, we present a methodology for automatic segmentation and labelling of cultural heritage objects from 3D point cloud scenes. The proposed methodology is based on machine learning techniques and, in particular, makes use of the concept of transfer learning. Memorial objects are segmented from the scene based on their geometric shape characteristic through a conditional multi-scale partitioning scheme. Then, high-level latent feature descriptors are extracted by a convolutional neural network pre-trained on different 3D object models from a standard dataset (e


INTRODUCTION
Historic, cultural heritage and archaeological sites can be interpreted as hierarchical organisations of objects.The process of mapping and keeping an inventory of physical objects is fundamental to site conservation, management and analysis.Traditionally, objects are observed physically and recorded manually by an operator.Recent advances in remote sensing technologies, such as light detection and ranging (LiDAR) and digital photogrammetry, make it possible to instead create digital representations in the form of 3D point clouds.Indeed, 3D scanning technologies are becoming both more affordable and more versatile (Chase, Chase and Chase 2017;Favorskaya and Jain 2017;Royo and Ballesta-Garcia 2019).LiDAR based sensor hardware is appearing in both wearable systems and handheld devices.Additionally, photogrammetry software allows 2D images to be stitched together into a 3D point cloud scene; with this there is the potential to turn any camera into a proxy 3D sensor.These offer noninvasive, fine resolution alternatives to manual recording.As a result, 3D point cloud data are becoming a valuable resource for the fields of archaeological and cultural heritage.However, the question is then how to design an automated methodology for extracting, labelling and organising objects from these point clouds; especially one that is suitable for the real-world context.
Despite the adoption of digital technology, it remains a time-consuming task for an operator to find and label each object.Machine learning techniques that seek to automate object detection in point cloud data have been recently proposed; the most notable of which are built around convolutional neural networks (CNN) (Bello, Yu and Wang 2020).These supervised networks are composed of sequential layers wherein increasingly complex features are extracted.Specifically, the convolutional layers slide systematically a learnable convolution matrix, or kernel, across an input.This aggregates information from adjacent entities into features that are then passed to the next layer.Provided with a large training set of labelled data, CNNs are capable of generating discriminative high-level features (Bello, Yu and Wang 2020).The problem is that point clouds are an unusual data type.They are an unordered set of points in space and represent the external surface of the sampled object or scene.Each point is a vector denoting its x, y, z coordinates, and, depending on the specific acquisition technology, the points may contain additional observed information such as colour or intensity.Moreover, CNNs cannot easily take unstructured point clouds as input.
Many practical problems, and, in particular, cultural heritage and archaeological applications, often have limited access to labelled data, and in some cases the required data may be entirely non-existent, a necessary component for training CNNs.Training of CNNs is also computationally demanding.Moreover, the addition of new validation data requires a complete retrain of the network.Therefore, it is not immediately clear how to take advantage of point clouds in real world applications such as these.To this end, we present a methodology suitable for the automatic extraction and identification of objects from cultural heritage sites.We highlight how point cloud data can be used directly to map and extract objects from archaeological and cultural heritage contexts without the need to first rasterise or convert to another representation (e.g., digital elevation, surface or terrain models).We apply and validate the proposed methodology for the task of locating, extracting and labelling grave markers from cultural heritage sites.
Grave marker detection is a relatively unexplored area.Notably, to the best of our knowledge, this is the first research of its kind on the extraction of grave marker objects directly from 3D point cloud data.Grave markers can be made from many different material components and take on a multitude of different sizes and shapes depending on their location, environment, age, condition and cultural background.Therefore, a highly generalisable methodology is necessary for their detection.While not related directly to the proposed methodology, LiDAR data have been used previously to aid in cemetery surveys (Weitman 2012).Additionally, point clouds have been used to represent memorial object models; for example Jaklič et al. 2015 reconstructed sarcophagi from a point cloud data representation of a sunken Roman shipwreck.Zacharek et al. 2017 presented a low-cost approach to the collection of 3D grave marker models.To look below ground, Cannell et al. 2018 used ground penetrating radar and geochemical analysis to explore an unmarked graveyard at a medieval church site in Norway.
A recent focus in the literature has been on applying machine learning techniques to automate the detection of archaeological and cultural heritage objects.Initial methods were concerned mainly with 2D representations (Chase, Chase and Chase 2017).However, the growing presence of LiDAR as a digital surveying tool, as well as its integration into platforms such as geographic information system (GIS) software, has made point cloud analysis an important subject of research (Chase, Chase and Chase 2017).A natural step has been to ask how point cloud data can be paired with machine learning to benefit the archaeological and cultural heritage fields.Point cloud derived representations, such as surface and terrain models, have been used in conjunction with machine learning algorithms to automate the detection of barrows (Kramer 2015;Sevara et al. 2016) and Neolithic burial mounds (Guyot, Hubert-Moy and Lorho 2018), as well as in the detection of sub-surface archaeological structures (Fryskowska et al. 2017).While more traditional machine learning techniques are often employed, neural networks have been considered as well.Kazimi et al. 2018 demonstrated a CNN using LiDAR derived digital terrain maps to examine historic mining regions in Germany.However, by not using point clouds directly, these approaches fail to take advantage of the innate dimensional information inherent to 3D.
A limitation of traditional supervised machine learning processes is that they are domain-specific; that is, they make predictions through learned properties determined by the data with which they are trained.In contrast, transfer learning allows a machine learning model trained in one domain to be reapplied to another domain of similar data.Transfer learning is one possible solution for areas where limited training data are available.Within the context of machine learning applied to cultural heritage and archaeology, the transfer learning concept has been applied to images from remote sensing surveys (Trier et al. 2016;Trier, Cowley and Waldeland 2019;Zingman et al. 2016).More recently, Verschoof-van der Vaart and Lambers 2019 explored transfer learning with region-based CNNs to detect barrows, Celtic fields and charcoal kilns from LiDAR-derived 2D images.
This paper presents a novel methodology to address the automatic extraction and identification of object instances within cultural heritage sites represented as 3D point clouds.The contributions of this methodology are: (i) how to operate directly on point clouds in a way suitable for real world application to cultural heritage and archaeology contexts, (ii) how to use the discriminative power of CNNs while mitigating their limitations and (iii) how to address the inherent challenges associated with point cloud data while benefiting from their 3D nature.Additionally, we propose a conditional multiscale partitioning scheme within the methodology to ensure ground level objects are detected.In contrast to previous methodologies applied in a cultural heritage and archaeology context, the methodology presented in this paper involves methods applied directly to the 3D point cloud data rather than first transforming them to another structure.
The remainder of this paper is organised as follows.The methods and structure of the proposed methodology are detailed in Section 2; Section 2.1 details the segmentation process and extraction of geometric features and Section 2.2 describes the approach used for classification.Section 3 details the experimental results and Section 4 discusses the findings along with suggestions for future research.Finally, Section 5 concludes the paper.

METHODOLOGY
In the proposed methodology we consider the input point cloud P as a set of 3D points {p_i| i = 1, …, n} such that p i ∈ P where n is the total number of points in P. The points P i are a vector of coordinates (x, y, z).The input P represents scenes of cultural heritage sites and is assumed to have been registered and pre-processed to remove outlier and duplicate points.The goal of the methodology is to partition the scene into segments S = {S 1 , …, S h } and provide a label L for each from a set of semantic classes C. To do so, it is comprised of two steps: segmentation and classification.

1.
Segmentation is an unsupervised process that seeks to partition the point cloud scene into regions based on the continuity and homogeneity of the properties.We define the regions as local neighbourhoods and compute features that describe their similarities.We further, embed this information into an attributed graph structure and approximate the segments with smooth pre-defined shapes by a generalised minimal partition model, a type of loss function (Landrieu and Simonovsky 2017).This serves two purposes.First, the resulting segments effectively represent the objects contained within the scene, either in parts or as a whole.Second, by considering segments rather than individual points, the classification task is made easier as there is guaranteed to be fewer segments than points.See Figure 1 for an illustration of the segmentation method.2. For classification, we explore the idea of transfer learning and pre-train the ConvPoint network (Boulch 2019) on generic object models from the ModelNet40 (Wu et al. 2015) dataset.Transfer learning allows us to take advantage of the discriminative power of the CNN and couple it with the flexibility from more classic models for per-class training.We apply each partitioned segment to the pre-trained network to generate a set of high-level abstract features.These features represent a global descriptor for the segments and are input into a multilayer perception (MLP) network classifier to predict the class labels.
See Figure 4 for illustration of the classification method.

POINT CLOUD SEGMENTATION
In general, point cloud data are unstructured.That is, there is no defined neighbourhood to connect each point in space.This is in contrast to 2D images, where each pixel sits on a grid and has explicit neighbouring pixels.To extract meaningful features then, some form of structure must be imposed and designed specific to point cloud data.The common approach is to avoid processing the 3D data directly, instead rasterising the point cloud into multiple 2D representations (Su et al. 2015).An alternative is for the points to be placed within volumetric containers such as voxels (Qi et al. 2016b).Methods such as Spin Images (Johnson and Hebert 1999), kernel signatures (Aubry, Schlickewei and Cremers 2011;Bronstein and Kokkinos 2010), and inner-distance descriptors (Ling and Jacobs 2007) use a local estimate of the underlying surface around the point.Recent kernel methods build on this.
Recently, many researchers have begun to explore structure applied more directly to 3D point clouds, such as tree-based (Klokov and Lempitsky 2017; Riegler, Ulusoy and Geiger 2017), graph-based (Simonovsky and Komodakis 2017) and set-based approaches (Qi et al. 2016a(Qi et al. , 2017)).A more traditional, but equally direct solution, is to perform a point-wise search and define a structure based on the points' local neighbourhoods.
The term 'segmentation' in the context of point cloud data means the partitioning of spatial regions within the scene, based on some criteria.We can distinguish between two classes of point cloud segmentation problems commonly found in the literature.The first class of problems is to segment the scene based on some geometric similarity or characteristic and can be seen as the inference of object detection or localisation.The second class of problem is semantic segmentation, a fine-grained instance of classification.This segmentation performs point-wise classification, where individual points are provided with a label.Thereby, the scene is partitioned based on semantic similarity.
A simple form of the first class of segmentation problem is to partition the foreground and background of a scene (Dohan, Matejek and Funkhouser 2015;Golovinskiy, Kim and Funkhouser 2009).Because this type of segmentation represents regions of similarity, it is used regularly as precursor to object classification (Golovinskiy, Kim and Funkhouser 2009;Shapovalov, Velizhev and Barinova 2010).Spina et al. 2011 demonstrated this type of point cloud segmentation in a cultural heritage context.Similar to Hackel, Wegner and Schindler 2016 and Guinard and Landrieu 2017 our method concerns the use of local point neighbourhoods by which we extract features to represent local regions of the scene.This method was chosen in contrast to semantic segmentation, which would require a network to be trained for specific terrain types as well as objects.
We consider a point-wise search to define local neighbourhoods.One such strategy is to search using a fixed-radius r, whereby a spherical (Lee and Schenk 2002) or cylindrical (Filin and Pfeifer 2005) representation is used to define the neighbourhood.Another is to consider the k-nearest neighbours around each point, based on some form of distance metric.This typically involves 2D (Niemeyer, Rottensteiner and Soergel 2014) or 3D (Jonathan et al. 2001) distances.As noted by Weinmann et al. 2015, for this solution to remain practical across varying scene types, search-based solutions require some form of optimization.This is either in terms of r or k, respectively.We define the points' local neighbourhoods through a k-nearest neighbour search in Euclidean space and optimise k based on eigenentropy, as advocated by Guinard and Landrieu 2017.This approach is suited to different point densities and gives more precise control over neighbourhood size (Weinmann et al. 2015).

Feature extraction
In this section, we present the features and algorithms used for the segmentation process.
The first stage of the proposed methodology is the segmentation process.Here, features that characterise the local dimensionality of the scene are extracted.For each point p i , the k-nearest neighbouring points in the point cloud P are selected and the covariance matrix of their positions is calculated.From this we obtain the set of eigenvalues λ 1 ≥ λ 2 ≥ λ 3 and corresponding eigenvectors u 1 , u 2 , u 3 .To determine the optimal size for k, a specific  energy function, the same as in (Weinmann et al. 2015), is used to minimise the eigenentropy E of the vector (λ 1 /∧, λ 2 /∧, λ 3 /∧): ( ) This results in neighbourhoods which have maximum homogeneity or minimum disorder of points within the neighbourhood.The size of k is varied between k min = 10 and k max = 100 in increments of 1 (i.e., ∆ k = 1).
Using the eigenvalues, we construct a set of features f i ∈ R 4 , which characterise the neighbourhood's local dimensionality and geometry.We use linearity, planarity, scattering (Demantké et al. 2012) and verticality (Guinard and Landrieu 2017): The first three features are often referred to as dimensionality.Linearity describes how well the neighbourhood represents a 1-dimensional straight line, while planarity describes how well it fits to a 2-dimensional plane.Similarly, sphericity (also referred to as scattering in the literature) measures how well the neighbourhood resembles a sphere.Verticality indicates the geometric orientation of the neighbourhood; for example, Verticality min = 0 represents a horizontal orientation whereas Verticality max = 1 represents a vertical orientation (Guinard and Landrieu 2017).Examples of these features can be seen in Figure 2.

Adjacency graph structure
A graph structure can be used to capture how different entities are related to one another.The graph nodes (or vertices as they are sometimes called) represent a singular entity, while edges connecting nodes represent the relationship between entities.The edges may be either directed, such that they can be traversed only in a single direction, or undirected, such that they can be traversed in either direction.Graphs are commonly used in machine learning to represent probabilistic models.For example, Bayesian networks, Markov random fields (MRF), and conditional random fields (CRF), are all graphical models.Additionally, graphical models may also be used as the basis of a graph CNN, a generalisation of convolution operations to arbitrarily structured graphs (Landrieu and Simonovsky 2017).Applied to point clouds, graphical models can be used as both a structure and for data analysis (Bronstein et al. 2017).Niemeyer et al. 2011 proposed graphical models to encode the spatial relationship between points into a graph structure called an adjacency graph.Furthermore, they showed how point cloud density and number of adjacent points affect this construction.Regarding this, they concluded that a larger neighbourhood has  the potential to better represent adjacency, albeit at a significant computational trade-off.Refining this conclusion, Guinard and Landrieu 2017 advocated a graph that represents the adjacency of the 10 nearest points.Note that the neighbourhood of points represented in the adjacency graph is different to the neighbourhood used for feature extraction.
To encode the spatial relationship between points, the point cloud is represented using an undirected adjacency graph G nn = (V, E nn ).The set of nodes V = {V 1 , …, V n } is constructed from each point in the point cloud, whereby each point p i is represented by its associated features vector f i , and the edges E nn encode the adjacency relationship of the 10 nearest neighbour points (Niemeyer et al. 2011).Segmentation is then a process in which the graph is split optimally into nonoverlapping connected components.These splits are computed using the l 0 -cut pursuit algorithm (Landrieu and Obozinski 2017) and defined as the vector g* ∈ R ×n which minimises the following generalised minimum partition model: * arg min --0 , with g as the variable value used to determine the optimal minimisation.The Iverson bracket [⋅] yields 0 if the internal expression is true, and 1 everywhere else.The edge weight E w R + Î is chosen to be linearly decreasing with respect to the edge length and factor ρ is the regularisation strength, which determines the coarseness of the resulting partition (Landrieu and Simonovsky 2017).This formulation ensures that the resulting point cloud segments correspond to similar values of f without the need to define a maximum size for the segments.The point cloud segments are represented as the set S = {S 1 , …, S h }, where h is the number of segments returned by the cut adjacency graph.For clarity, the segments are the non-overlapping connected components.The segments are subsets of the original point cloud and the number of points vary per segment, see Figure 3.
To increase the chances of finding smaller objects that may have been missed in the initial segmentation, a conditional multi-scale partitioning scheme is proposed.This secondary conditional partition considers only the largest 10% of planar segments.These are passed through the segmentation process again, with the neighbourhood for feature extraction adjusted to within a radius defined by the point density.If new components are found, then they are added to the set of segments.Otherwise, the segment is assumed to have continuous local shape and considered to be a single segment.This process ensures ground level objects are detected in large segments of ground level points.

CLASSIFICATION
Using the 3D point clouds directly (instead of converted 2D representations) in the classification method is essential to the real-world applicability, efficiency and generalisability of the overall methodology.Conversions to other representations would not only result in information loss but would require an in-between representation, such as mesh models (Su et al. 2015).This conversation alone remains a difficult task when applied to fine resolution real-world 3D point cloud data.Additionally, 2D multi-view methods are sensitive to viewpoint selection and occlusions within query instances.The real-world extracted objects are likely to contain noise from background objects (i.e., vegetation) and registration artefacts.While more recent multiview 2D methods achieve leading accuracy scores on benchmarks, they rely on observational colour (RGB) information (Yu, Meng and Yuan 2018).Many fineresolution point cloud datasets (especially those from LiDAR sensors) do not include this information as it requires specialised equipment and processing to collect and register the colour dimensions to the spatial points.Even if they are included, several factors present during the scanning process (e.g., glare, moisture, motion blur, camera focus, etc.) can contribute to inconsistent RGB values.This is not to say that RGB or other multispectral data should not be used when available, but that to remain generally applicable, the classification method should ingest directly the point cloud data and not be dependent on any additional observed features.
For the classification sequence the transfer learning paradigm is followed.The ConvPoint network is pretrained using generic object models from the ModelNet40 benchmark.An example of the adapted ConvPoint network is provided in Figure 5.The ConvPoint CNN was chosen because of its flexibility.It does not require a set input size and is robust to the permutation, scale and translation of the input 3D point cloud (Boulch 2019).As the name suggests, the Modelnet40 dataset is a collection of 3D models from across 40 different object classes.It is important to note that the classes available in ModelNet40 do not include directly any of the target labels for classification (e.g., cultural heritage objects).ConvPoint directly ingests the spatial coordinate points through an adaptation of discrete kernel convolutions to be continuous.A simple MLP learns a dense geometrical weighting function that independently distributes the input points onto a kernel.At each layer, the convolution operation effectively mixes the estimation in the feature space and geometrical space.The derived kernel is then an explicit set of points associated with weights.Normalisation is added according to the input set size (Boulch 2019).The final fully connected layer of the pretrained network is used to leverage the weighted layers as a fixed feature extractor.
Using the last convolutional layer, we compute a 1 by 512 feature vector x h , where {x_h | 1, …, h} is global descriptor for each segment S h .This vector is used to define an abstract feature space which is optimised for the separation of the training objects.Transfer learning, as a concept, assumes that this feature space can also be used to separate the new test objects.
A simple MLP was trained to learn the difference in the feature space, therefore, leverage the knowledge learned to classify new data and apply semantic labels L. The MLP itself is formulated as one hidden layer with 100 units, one output layer and uses the logistic sigmoid activation function, ( ) ( ) ( ) The features in x are assumed to be normally distributed and as such each is standardised by setting the features' mean at 0 and scaling to unit variance of 1; e.g., compute the standard score z = (x -mean(x))/std(x) per feature.In doing so, we found this to increase classification accuracy results by at least 5%.See Figure 5 for an illustration of the transfer learning procedure.We test and compare a variety of supervised classifiers for the classification task, which can be found in Section 3.

ModelNet10
The Princeton ModelNet project provides a collection of 3D CAD object models split into two benchmarks: a 40-class subset and 10-class subset known as ModelNet40 and ModelNet10, respectively.The ModelNet10 data set was used to analyse the performance of the proposed transferred ConvPoint global descriptor in the classification process.The dataset was divided into training and validation sets.The CAD models were converted into 3D point clouds by randomly sampling points along the model surfaces.Table 1 shows summary statistics for all 10 classes and their corresponding training and test samples.

Cultural Heritage Scenes
Two separate cultural heritage sites represented as 3D point cloud scenes were chosen for the evaluation of the proposed methodology applied to real-world data.The digitised cultural heritage sites were provided by the burial ground management system team at Atlantic Geomatics (UK) Limited.Scene 1 is a burial ground from Adlington civil parish in North West England.The scene is a subset of a much larger scene; the same large scene from which the classification training data were acquired.Scene 2 is a separate dataset: a burial ground located in Stainesupon-Thames in South East England.It is not taken from a larger scene.The scenes were collected by a terrestrial LiDAR sensor platform with a relative accuracy of 2 to 3 cm.In the analysis, four separate semantic classes of objects were targeted; memorial objects (grave markers such as headstones, stone crosses, sarcophagus, etc.), infrastructure (buildings, walls, gates, street poles, etc.), vegetation (tall grasses, shrubs, trees, canopy leaves, etc.) and ground (grass terrain, roads, paths, etc.).

Evaluation metrics
Following the standard convention from the field of machine learning, Precision, Recall and F 1 -score, along with their the macro-and weighted-average variations, were used as evaluation metrics.In a classification context, these metrics are ratios defined with respect to the number of true positives TP, false positives FP and false negatives FN returned per class.The recall, defined as TP/(TP + FN), indicates the classifier's ability to find all positive samples.Likewise, precision is the fraction TP/ (TP + FP) that reflects the ability to return more relevant results then irrelevant ones.The F 1 is a measure of the classifier's accuracy.Formulated as F 1 = 2(Precision Recall/(Precision + Recall)), it considers both precision and recall.All three metrics produce a score in the range [0, 1], reaching their worst value at 0 and best value at 1.The macro-average variation is then the mean of all scores divided by the number of classes.Similarly, the weighted-average is the score of each class weighted by the number of samples from that class.

Processing Platform
Experiments were run on a Unix machine with 2.7 GHz Intel Core i5, 16GB RAM and SSD.The combined process of segmentation and classification had an average runtime of 25 minutes for a point cloud of roughly 7 million points.
Several factors present during the scanning process can contribute to inconsistent observational colour (R, G, B) values.Furthermore, while some processes of point cloud acquisition, such as photogrammetry, inherently provide data as R, G, B, specialised equipment and processing is needed to register the colour dimensions to the point clouds generated by LiDAR sensors.As a result, many fine spatial resolution point cloud data sets do not include this information.In the proposed methodology we restrict the points to contain only the x, y, z coordinate information.

Comparison of Classification Algorithms within the Proposed Methodology
Baselines are used generally to determine how well an algorithm performs.Thus, it can be a problem when a baseline for the specific domain does not exist.Applying the intuition behind transfer learning we conducted an initial experiment to gauge the effectiveness of our approach.We chose to assess a variety of supervised classification algorithms from the Scikit-learn Python package (Pedregosa et al. 2011) and test each against the ModelNet10 benchmark.A 'best case' for the baseline can be provided with the test data matching the data used to train the CNN feature extractor.To better reflect the objects recovered from the segmentation process, we chose to vary the number of points sampled, per model, to between 32 and 2048 points.The ModelNet10 point clouds were then given a global descriptor set using the adapted ConvPoint network.This allowed us to explore how different classifiers interact with the data and determine the most appropriate approach for classification.
We interpret from the results in Table 2 that the multilayer perceptron (MLP) network implementations are the most promising among the tested methods, although the linear support vector machine (SVM) achieves similar scores, placing it behind the MLP by as little as 1% in the majority of metric categories.Within this experiment we investigated and compared the behaviours of different MLP activation functions.In particular, the weightedaverage F 1 -score for the MLP with sigmoid activation performed particularly well, with an increase of at least 1-2% over the other MLP formulations.Consequently, this translates to a 7% and 5% F1-score increase over the next most accurate methods behind the SVM; random forest and k-nearest neighbours classifiers, respectively.The Gaussian Naive Bayes and decision tree classifiers performed the least accurate, where the MLP(sigmoid) had an increase of 21% and 19%, respectively.

Handcrafted global descriptors versus the transfer learning approach
The comparison of the proposed transfer learning global descriptor and commonly applied global features available in the open source Point Cloud Library (Rusu and Cousins 2011) are shown in  6), although this should apply generally to any real-world dataset.
The classification results obtained for real world data from a pre-segmented cultural heritage scene are presented in Table 4. Analysis of these scores indicates that the MLP classifier outperformed the other tested methods; thus, the MLP demonstrated an ability to handle the real-world data.This is in support of the earlier assessments of the MLP classifiers.It is interesting to note that, given real world data, the random forest model performed closely with the linear SVM, and in fact achieved the largest macro-average precision score; this is in contrast to the earlier experiment.Based on these results, we concluded that an MLP with sigmoid activation function is the most suitable, of the tested classifiers, for use within the proposed methodology.

Evaluation of Methodology on Cultural Heritage Scenes
We applied the methodology to two separate cultural heritage scenes, the results of which are shown in Tables 5 and 6.The same training data were used in both scenes to train the MLP classifier.For both scenes, the proposed approach achieved a weighted average of at least 91% across all metrics.In general, classification of memorial objects was highly accurate, with an F 1score of 92% and 95% for scenes 1 and 2, respectively.The results from scene 2 illustrates how the proposed methodology generalises to a different dataset without retraining, even across two different and spatially distant regions.Classification accuracies for memorial, vegetation and infrastructure objects were similar to the scores in scene 1.However, there is a decrease when correctly determining segments that contain ground points.This is likely explained by the difference in terrains (e.g., slopes, flats, hills, etc.) between scenes; without the addition of these landscape characteristics the classification model can struggle to accommodate these changes.

DISCUSSION
Misclassification of the cultural heritage data lies in the infrastructure objects class.This is, in part, to be expected as memorial objects are often subjective.Cultural heritage sites normally contain various items of street furniture, and those in and of themselves might be a type of monument, e.g., a bench object may be classified into either the memorial or infrastructure class depending on semantics alone, with no visually distinct reasoning.The same can be true for trees and shrubs.The experiment classification index adheres strictly to memorial and non-memorial objects based on the manually labelled scene, which does not take this into account.This raises questions of how to impose semantic meaning to objects with little or no visually discerning attributes.Pre-segmentation can also influence the classification results.It is possible for buildings and walls to be partitioned into smaller parts which share characteristics with headstone monuments or even vegetation, thus, potentially resulting in misclassification.In this sense, classification results are contingent on the quality of the segmentation process.An alternative to an object-wise segmentation would be to use a region-growing or pointwise algorithm.However, point density and noise have a direct impact on the time complexity of such methods.As a result, they have a limited ability to segment largescale point clouds (Landrieu and Simonovsky 2017).Experiments showed that the proposed methodology is capable of running on a personal computer.However, we note that RAM capacity was a limiting factor.Considerably large point clouds may need to be divided manually into smaller regions beforehand or else down sampled, provided that there is no great loss in visual representation.The objects in question should be easily identified by an operator.
In the future, we are interested in exploring how a more fine-grained classification could be achieved within the object classes.For example, many different burial ground monument types are found in a single cemetery.Additionally, grave markers from different geographical areas, different time periods and coming from different cultures, likely appear distinct and different from one another.With the general public becoming more interested in family ancestry and genealogy, there is a real need for this information to be available and to be provided at scale.Similarly, we are interested in ways to incorporate new object variations, unknown objects and additional classes to the methodology.The grounds, building infrastructure and serviceable equipment, etc., are all objects of importance for the maintenance of cultural heritage sites.It is, therefore, of value if the classification model does not have to be completely retrained each time a new variation is seen.Furthermore, based on the results of the transfer learning approach within this research, we are motivated to explore the use of various point cloud specific neural networks as feature extractors and evaluate their relative performances.(e) MLP.

CONCLUSIONS
We presented a new methodology for the automatic identification and extraction of objects from 3D point cloud representations of cultural heritage sites.This methodology addressed how point cloud data can be used directly to map and extract objects from archaeological and cultural heritage contexts, without the need to rasterise or transform the data into another representation beforehand.Benchmarking exercises established that, compared to several classification methods, the proposed methodology achieves a statistically higher accuracy for both artificial and realworld datasets.We applied the methodology to the task of locating, extracting and labelling grave marker objects from two cultural heritage sites.The results demonstrated that the proposed approach can leverage transfer learning to separate objects from the scene and distinguish between multiple classes.We believe that this is the first time that such a methodology has been developed for the automatic and direct extraction and labelling of memorial objects from cultural heritage sites using 3D point cloud data.

Figure 1
Figure 1 Concept of the proposed segmentation methodology.Solid arrows represent the flow of processes; dashed arrows represent conditional processes.A complex 3D point cloud scene is taken as input and divided into multiple segments based on the set of features extracted in relation to the points' local neighbourhood.These segments are subsets of the original scene, themselves being point clouds, and are assumed to represent the objects contained in the scene.

Figure 2
Figure 2 Geometric features shown for an example image containing grave markers: (a) linearity, (b) planarity, (c) scattering and (d) verticality.Point cloud data provided by Atlantic Geomatics (UK) Limited.

Figure 3
Figure 3 An example of a partitioned scene.The segments are assigned a colour randomly for demonstration purposes.Point cloud data provided by Atlantic Geomatics (UK) Limited.

Figure 4 Figure 5
Figure 4 Illustration of the proposed classification methodology.The segments produced by the segmentation methodology are used as input; each is a 3D point cloud.Solid arrows represent the flow of processes.The dotted arrow indicates that the descriptors in the training set are created using the same pre-trained CNN model.The output to this process is a set of labels that relate the input point cloud segments to their predicted class.

Figure 6
Figure 6 Examples of a classified region from different classifier methods.Ground points are represented in yellow; vegetation is in blue; infrastructure is in orange and memorial objects are marked in green and red showing two sub-class identifications (headstone and cross).Point cloud data was provided by Atlantic Geomatics (UK) Limited.

Table 1
Classification index for the ModelNet10 dataset; including number of training and test samples.

Table 3 .
These global

Table 2
Results of the experiment to determine a baseline for classification.The values before the slash represent the macroaverage score and after the slash the weighted-average score.The largest value for each metric is shown in bold.

Table 3
Classification results comparing global descriptors from the Point Cloud Library to the proposed transferred global descriptor on the ModelNet10 dataset.The values before the slash represent the macro-average score and after the slash the weighted-average score.The largest value for each metric is shown in bold.

Table 5
Precision, recall and F1-score of MLP classification applied to Scene 1. Scores are an average result after 100 runs.

Table 6
Precision, recall and F1-score of MLP classification applied to Scene 2. Scores are an average result after 100 runs.