Commit 4ce84df
Separate prefix for untagged nodes with
As of this writing, the complete OSM data has 11 B objects, around 90% of which are untagged nodes, that is, nodes with zero tags (key-value pairs). Most queries involve a join of a set of tagged objects with their geometry, a minimal example being `?object osmkey:amenity "post_box" ; geo:hasGeometry ?geometry .` This is a join of a relatively small table (400 k rows in this case) with a very large table (11 B rows). In most queries, this will be a merge join, which is efficient in principle. However, the matching rows are typically "dense" in the large table because IRIs for tagged and untagged nodes both start with `https://www.openstreetmap.org/node/` followed by a number that has nothing to do with whether the node is tagged or untagged. Then the merge join has to scan through most of the very large table, which is expensive due to its sheer size.
With `--iri-prefix-for-untagged-nodes`, a distinct prefix can be given to untagged nodes, for example `http://www.openstreetmap.org/node/` (which has the nice side property of still working in a browser). The matching rows in the merge join from the example above are then all contained in a range of only around 1 B rows, which for an engine such as QLever (which is smart enough to figure out that it can skip the remaining 90% of the rows) speeds up the merge join by a factor of around 10.
When `--iri-prefix-for-untagged-nodes` is used with a prefix other than `https://www.openstreetmap.org/node/`, then also the objects of `geo:hasGeometry` and the subjects of `geo:asWKT` are given names such that the prefix for tagged and untagged nodes is different, for example, `osm2rdfgeom:osmnode_tagged_1` and `osm2rdfgeom:osmnode_untagged_137`. That way, the same speedups can be achieved for joins with the `geo:asWKT` predicate. This is important because most OSM queries involve a join with `geo:hasGeometry/geo:asWKT`, so both of the mentioned predicates.
The modified node IRIs appear in three kinds of triples in the output of `osm2rdf`. First, as subject and object of `geo:hasGeometry` . Second, as subject of `geo:asWKT`. Third, as object of `osmway:member_id`. Using the modified IRI in the first two places is trivial because when we write those triples, we have seen the node for the first time and we know whether its tagged or untagged. For the third place, it's more tricky because this happens later in the output and we need to remember whether a node is tagged or untagged. So far, only the coordinates of each node is remembered. We remember the additional bit by using the most significant bit of the `y` coordinate. That way, no more space than is used than so far, which is important because of the sheer number of nodes.
NOTE 1: To implement this, we need a modified version of the headers `Location.h` and `NodeLocationsForWays.h` from `libosmium`. According to osmcode/libosmium#395, this change cannot be made in `libosmium` itself, at least for now. We therefore use modified copies. These potentially have to be updated when pulling newer version of `libosmium`.
NOTE 2: There is still a bug in that the precomputed geometric relations (`ogc:sfContains`, etc.) don't use the modified IRIs when tagged and untagged nodes have different prefixes. This is easy to fix and will be done in a subsequent PR.
Co-authored-by: Patrick Brosi <[email protected]>--iri-prefix-for-untagged-nodes (#120)1 parent 83acf9b commit 4ce84df
26 files changed
Lines changed: 1105 additions & 135 deletions
File tree
- include/osm2rdf
- config
- osm
- ttl
- src
- config
- osm
- ttl
- tests
- issues
- osm
- ttl
- vendor
- osmcode
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | 60 | | |
64 | 61 | | |
65 | 62 | | |
| |||
95 | 92 | | |
96 | 93 | | |
97 | 94 | | |
| 95 | + | |
98 | 96 | | |
99 | 97 | | |
100 | 98 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
25 | 26 | | |
26 | 27 | | |
27 | | - | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
33 | 34 | | |
34 | 35 | | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
| 36 | + | |
39 | 37 | | |
40 | 38 | | |
41 | 39 | | |
42 | 40 | | |
43 | 41 | | |
44 | 42 | | |
45 | 43 | | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
| 44 | + | |
50 | 45 | | |
51 | 46 | | |
52 | 47 | | |
| |||
87 | 82 | | |
88 | 83 | | |
89 | 84 | | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
90 | 90 | | |
91 | 91 | | |
92 | 92 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
209 | 209 | | |
210 | 210 | | |
211 | 211 | | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
212 | 220 | | |
213 | 221 | | |
214 | 222 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| 32 | + | |
| 33 | + | |
32 | 34 | | |
33 | 35 | | |
34 | 36 | | |
| |||
42 | 44 | | |
43 | 45 | | |
44 | 46 | | |
| 47 | + | |
45 | 48 | | |
46 | 49 | | |
47 | 50 | | |
| |||
106 | 109 | | |
107 | 110 | | |
108 | 111 | | |
| 112 | + | |
109 | 113 | | |
110 | 114 | | |
111 | 115 | | |
| |||
0 commit comments