Skip to content

Commit 4ce84df

Browse files
hannahbastpatrickbr
authored andcommitted
Separate prefix for untagged nodes with --iri-prefix-for-untagged-nodes (#120)
As of this writing, the complete OSM data has 11 B objects, around 90% of which are untagged nodes, that is, nodes with zero tags (key-value pairs). Most queries involve a join of a set of tagged objects with their geometry, a minimal example being `?object osmkey:amenity "post_box" ; geo:hasGeometry ?geometry .` This is a join of a relatively small table (400 k rows in this case) with a very large table (11 B rows). In most queries, this will be a merge join, which is efficient in principle. However, the matching rows are typically "dense" in the large table because IRIs for tagged and untagged nodes both start with `https://www.openstreetmap.org/node/` followed by a number that has nothing to do with whether the node is tagged or untagged. Then the merge join has to scan through most of the very large table, which is expensive due to its sheer size. With `--iri-prefix-for-untagged-nodes`, a distinct prefix can be given to untagged nodes, for example `http://www.openstreetmap.org/node/` (which has the nice side property of still working in a browser). The matching rows in the merge join from the example above are then all contained in a range of only around 1 B rows, which for an engine such as QLever (which is smart enough to figure out that it can skip the remaining 90% of the rows) speeds up the merge join by a factor of around 10. When `--iri-prefix-for-untagged-nodes` is used with a prefix other than `https://www.openstreetmap.org/node/`, then also the objects of `geo:hasGeometry` and the subjects of `geo:asWKT` are given names such that the prefix for tagged and untagged nodes is different, for example, `osm2rdfgeom:osmnode_tagged_1` and `osm2rdfgeom:osmnode_untagged_137`. That way, the same speedups can be achieved for joins with the `geo:asWKT` predicate. This is important because most OSM queries involve a join with `geo:hasGeometry/geo:asWKT`, so both of the mentioned predicates. The modified node IRIs appear in three kinds of triples in the output of `osm2rdf`. First, as subject and object of `geo:hasGeometry` . Second, as subject of `geo:asWKT`. Third, as object of `osmway:member_id`. Using the modified IRI in the first two places is trivial because when we write those triples, we have seen the node for the first time and we know whether its tagged or untagged. For the third place, it's more tricky because this happens later in the output and we need to remember whether a node is tagged or untagged. So far, only the coordinates of each node is remembered. We remember the additional bit by using the most significant bit of the `y` coordinate. That way, no more space than is used than so far, which is important because of the sheer number of nodes. NOTE 1: To implement this, we need a modified version of the headers `Location.h` and `NodeLocationsForWays.h` from `libosmium`. According to osmcode/libosmium#395, this change cannot be made in `libosmium` itself, at least for now. We therefore use modified copies. These potentially have to be updated when pulling newer version of `libosmium`. NOTE 2: There is still a bug in that the precomputed geometric relations (`ogc:sfContains`, etc.) don't use the modified IRIs when tagged and untagged nodes have different prefixes. This is easy to fix and will be done in a subsequent PR. Co-authored-by: Patrick Brosi <[email protected]>
1 parent 83acf9b commit 4ce84df

26 files changed

Lines changed: 1105 additions & 135 deletions

CMakeLists.txt

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,6 @@ add_compile_options(-march=native)
5757

5858
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -fsanitize=address,undefined")
5959

60-
# Enable fast-math
61-
add_compile_options(-ffast-math)
62-
6360
# ----------------------------------------------------------------------------
6461
# Configure dependencies
6562
# ----------------------------------------------------------------------------
@@ -95,6 +92,7 @@ find_package(POPL REQUIRED)
9592
include_directories(SYSTEM ${POPL_INCLUDE_DIR})
9693
find_package(Protozero REQUIRED)
9794
include_directories(SYSTEM ${PROTOZERO_INCLUDE_DIR})
95+
set(OSMIUM_INCLUDE_DIR "${CMAKE_SOURCE_DIR}/vendor/osmcode/libosmium/include")
9896
find_package(Osmium REQUIRED COMPONENTS pbf xml)
9997
include_directories(SYSTEM ${OSMIUM_INCLUDE_DIRS})
10098

include/osm2rdf/config/Config.h

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,31 +22,26 @@
2222

2323
#include <filesystem>
2424
#include <string>
25+
#include <thread>
2526
#include <unordered_set>
2627
#include <vector>
27-
#include <thread>
2828

2929
#include "osm2rdf/config/Constants.h"
30+
#include "osm2rdf/ttl/Constants.h"
3031
#include "osm2rdf/ttl/Format.h"
3132
#include "osm2rdf/util/OutputMergeMode.h"
3233

3334
namespace osm2rdf::config {
3435

35-
enum GeoTriplesMode {
36-
none = 0,
37-
full = 1
38-
};
36+
enum GeoTriplesMode { none = 0, full = 1 };
3937

4038
enum CompressFormat {
4139
NONE = 0,
4240
BZ2 = 1,
4341
GZ = 2,
4442
};
4543

46-
enum SourceDataset {
47-
OSM = 0,
48-
OHM = 1
49-
};
44+
enum SourceDataset { OSM = 0, OHM = 1 };
5045

5146
struct Config {
5247
// Select what to do
@@ -87,6 +82,11 @@ struct Config {
8782

8883
bool addSpatialRelsForUntaggedNodes = true;
8984

85+
bool separate = true;
86+
87+
std::string iriPrefixForUntaggedNodes =
88+
osm2rdf::ttl::constants::IRI_PREFIX__OSM_NODE_UNTAGGED;
89+
9090
int numThreads = std::thread::hardware_concurrency();
9191

9292
// Default settings for data

include/osm2rdf/config/Constants.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,14 @@ const static inline std::string NO_UNTAGGED_NODES_OPTION_LONG =
209209
const static inline std::string NO_UNTAGGED_NODES_OPTION_HELP =
210210
"Do not output untagged nodes";
211211

212+
const static inline std::string IRI_PREFIX_FOR_UNTAGGED_NODES_INFO =
213+
"IRI prefix for untagged nodes: ";
214+
const static inline std::string IRI_PREFIX_FOR_UNTAGGED_NODES_OPTION_SHORT = "";
215+
const static inline std::string IRI_PREFIX_FOR_UNTAGGED_NODES_OPTION_LONG =
216+
"iri-prefix-for-untagged-nodes";
217+
const static inline std::string IRI_PREFIX_FOR_UNTAGGED_NODES_OPTION_HELP =
218+
"IRI prefix for untagged nodes";
219+
212220
const static inline std::string NO_UNTAGGED_WAYS_INFO =
213221
"Do not output untagged ways";
214222
const static inline std::string NO_UNTAGGED_WAYS_OPTION_SHORT = "";

include/osm2rdf/osm/FactHandler.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@
2929

3030
namespace osm2rdf::osm {
3131

32+
class LocationHandler;
33+
3234
enum DateTimeType {
3335
invalid = 0,
3436
date_yyyy = 1,
@@ -42,6 +44,7 @@ class FactHandler {
4244
public:
4345
FactHandler(const osm2rdf::config::Config& config,
4446
osm2rdf::ttl::Writer<W>* writer);
47+
void setLocationHandler(osm2rdf::osm::LocationHandler* locationHandler);
4548
// Add data
4649
void area(const osm2rdf::osm::Area& area);
4750
void node(const osm2rdf::osm::Node& node);
@@ -106,6 +109,7 @@ class FactHandler {
106109

107110
const osm2rdf::config::Config _config;
108111
osm2rdf::ttl::Writer<W>* _writer;
112+
osm2rdf::osm::LocationHandler* _locationHandler;
109113
};
110114

111115
} // namespace osm2rdf::osm

0 commit comments

Comments
 (0)