LinkGeoML performs applied research on developing novel machine learning methods for interlinking spatio-textual data. In this frame, the project identifies four indicative, real-world use cases where we can provide solutions. All use cases consider data integration and annotation problems; for some of them, interlinking is the direct objective of the problem, while for others interlinking is an auxiliary task that facilitates the solution of the problem.
Use Case 1: Interlinking of Toponyms and Addresses
Commercial application: Geomarketing, Geocoding, Cadastration
The task of the use case the automatic identification of the same spatio-textual entities between two data sources. The problem is also known as entity matching/resolution, record linkage or de-duplication. In our setting, the task applies to both toponyms identified in proprietary cadastral and geomarketing databases and open databases (e.g. Geonames), as well as to Points of Interest (POIs) in proprietary geomarketing and open databases (OpenStreetMap). One of the challenges in interlinking spatio-textual entities is the fact that, often, their coordinates are either unreliable or not adequate to definitely decide whether two entities refer to the same real world entity. Thus, the name, as well as additional textual metadata that might be available, play a crucial role in interlinking spatio-textual entities. In this line of work, we extend traditional, string-similarity based training features with additional, domain based features and combine them with machine learning algorithms, so as to classify candidate pairs of POIs or toponyms as matching or non-matching.
Check our GitHub Toponym Interlinking repository for the first results of our work!
Use Case 2: Annotation of Points of Interest
Commercial application: Geomarketing, Geocoding
The task of the use case is the automatic annotation of Points of Interest (POIs) with categories from a predefined category hierarchy. The representation of the POIs at hand consists solely on the name and the pair of spatial coordinates (point coordinates) of the POI. The existence of only two, basic POI attributes renders the problem particularly challenging. In order to overcome this, we interlink the POIs at hand with neighboring POI information, as well as with POIs from OpenStreetMap, increasing the “contextual” information of our POIs, and allowing the extraction of more meaningful features, to be fed into machine learning models.
Check our GitHub POI Classification repository for the first results of our work!
Use Case 3: Geocoding of Addresses
Commercial application: Geocoding, Geomarketing
The task of the use case is the accurate geocoding of addresses, that is, the identification of the exact (X,Y) coordinates of a POI, given as input only the POI’s address (street name and number, city, zip code). To this end, we examine the optimal way to combine geocoding results from several Geocoders (ArcGIS, Nominatim, proprietary geocoder of Eratosthenis S.A.), by learning machine learning models to select the most fitting coordinates, or to optimally combine them.
Use Case 4: Integration of Land Parcels and Roads
Commercial application: Cadastration
The task of the use case is the integration of information from multiple land parcels, so as to annotate the category of the final parcel that is derived from them. Specifically, land parcels describing the same property are available from several sources. These sources are created in largely different time periods, focus on different aspects of the property and present differences in their timeliness and accuracy with respect to representing the geometry of the current property. In this scenario, we focus on learning models that are able to compare the individual, initial parcels, with the final one and derive a categorization of the final parcel, that depends on which initial parcel has mostly contributed into deriving the final parcel.
An additional task of the use case is the interlinking/integration of road segments from two different sources. In this scenario, the first source is considered the most reliable regarding the road segments it contains, while the second source contains more inaccurate information, however, might contain additional roads, not included in the first source. The goal is to learn models that can identify whether a road segment of the second source corresponds to a respective segment of the first source, or it is a new segment.