spark

Type Members

class InputStreamLengthLimit extends FilterInputStream with InputStreamSentinel

Keep in mind that this is not a thread-safe implementation.

class OsmPbfFormat extends FileFormat with DataSourceRegister

FileFormat implementation to read OSM Pbf files.

Basically, it is implemented in three steps:

Take every split file
Search first block of data per split. 2. Extract all entities that are present in the split, starting from the first block found and ending at the last block whose header is in the split.

Example:

Its usage is like other Spark connectors:

scala> spark.read.format("osm.pbf").load("").select("id","latitude","longitude","tags").filter("type == 0 and size(tags) > 0").show(false)
+------+------------------+------------------+-------------------------------------------------------+
|id    |latitude          |longitude         |tags                                                   |
+------+------------------+------------------+-------------------------------------------------------+
|272629|39.01344329999999 |-77.03783400000007|{ref -> 31, highway -> motorway_junction}              |
|278454|38.99543909999999 |-76.87885730000008|{noref -> yes, highway -> motorway_junction}           |
|278495|39.09727710000001 |-76.8004246000001 |{noref -> yes, highway -> motorway_junction}           |
|278499|39.10949280000001 |-76.7842974000001 |{noref -> yes, highway -> motorway_junction}           |
|278665|39.13675140000001 |-76.75700080000009|{highway -> motorway_junction, noref -> yes}           |
|278679|39.16433720000001 |-76.73629450000011|{noref -> yes, highway -> motorway_junction}           |
|278702|39.20996720000001 |-76.68302190000011|{noref -> yes, highway -> motorway_junction}           |
|281260|39.047928000000006|-77.15067590000012|{noref -> yes, highway -> motorway_junction}           |
|281323|39.1811582        |-77.2515329000001 |{ref -> 15A, highway -> motorway_junction}             |
|281359|39.152438         |-77.29614050000009|{highway -> traffic_signals}                           |
|287905|38.843457699999995|-77.1106591000001 |{highway -> traffic_signals, traffic_signals -> signal}|
|287913|38.85178319999999 |-77.13165830000011|{highway -> traffic_signals}                           |
|287943|38.876419799999994|-77.05647270000011|{curve_geometry -> yes}                                |
|390841|38.41534949999998 |-77.42648520000014|{ref -> 140, highway -> motorway_junction}             |
|390920|38.333087899999974|-77.49828340000015|{ref -> 133, highway -> motorway_junction}             |
|390955|38.29694369999998 |-77.50489810000018|{ref -> 130B, highway -> motorway_junction}            |
|391002|38.24086209999997 |-77.50026980000017|{ref -> 126B, highway -> motorway_junction}            |
|396346|37.97631839999997 |-77.49251700000009|{noref -> yes, highway -> motorway_junction}           |
|396542|37.93273709999998 |-77.46779110000014|{ref -> 104, highway -> motorway_junction}             |
|396693|37.84187889999999 |-77.45102540000016|{ref -> 98, highway -> motorway_junction}              |
+------+------------------+------------------+-------------------------------------------------------+
only showing top 20 rows

Note

Dataframe schema used is:

root
 |-- id: long (nullable = true)
 |-- type: byte (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- nodes: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- relations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- relationType: byte (nullable = true)
 |    |    |-- role: string (nullable = true)
 |-- tags: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- info: struct (nullable = true)
 |    |-- version: integer (nullable = true)
 |    |-- timestamp: timestamp (nullable = true)
 |    |-- changeset: long (nullable = true)
 |    |-- userId: integer (nullable = true)
 |    |-- userName: string (nullable = true)
 |    |-- visible: boolean (nullable = true)

class OsmPbfRowIterator extends Iterator[InternalRow]

Value Members

object OSMDataFinder

java.io.InputStream enricher class that adds the firstBlockIndex functionality.
java.io.InputStream enricher class that adds the firstBlockIndex functionality.
The idea is that every BlobHeaders starts with the same pattern. So to be able to find the first BlobHeader in a chunk file, we are looking for that pattern.
Just before every BlobHeader, there is a set of 4 bytes that contains the size of the next BlobHeader, but of course it is not a fixed value like the pattern. It is necessary to keep it in mind to ignore these 4 bytes.
So this is the structure of every block that contains data:
1. 4 bytes with header size.
2. 9 bytes with the constant "0x0A, 0x07, OSMData"
3. Rest of the header.
4. Blob containing data.
See also
OSM PBF Format documentation
object OsmPbfFormat
object OsmPbfRowIterator
object OsmSqlEntity

package spark

Type Members

class InputStreamLengthLimit extends FilterInputStream with InputStreamSentinel

class OsmPbfFormat extends FileFormat with DataSourceRegister

class OsmPbfRowIterator extends Iterator[InternalRow]

Value Members

object OSMDataFinder

object OsmPbfFormat

object OsmPbfRowIterator

object OsmSqlEntity

Ungrouped