Sparse Geospatial Data¶
In this tutorial we will walk through a real-world example of ingesting s parse geospatial data from a CSV file hosted on S3, into a TileDB array also hosted on S3. It is recommended to first read the tutorials on sparse arrays and attributes, as well as the tutorial on working with S3. We built the code example of this tutorial in C++, but by now you are hopefully able to port this to other languages using the TileDB APIs.
Project setup¶
For this tutorial we will ingest sparse geospatial data into a TileDB array and perform some basic slices on it. In particular, we will ingest a comma-separated (CSV) file containing the US ZIP codes with an associated latitude and longitude coordinate pair. The raw data used for this tutorial can be found in this GitHub Gist (credit @erichurst).
The tutorial will also assume you’ve already installed a TileDB release with S3 support on your system (see the Installation page for instructions on how to do that).
To get started, it will be easiest to use the example CMake project as a template that we will fill in.
Clone the TileDB repository and copy the examples/cmake_project
directory to where you want to store this project:
$ git clone https://github.com/TileDB-Inc/TileDB.git
$ mkdir ~/tiledb_projects
$ cd TileDB
$ cp -R examples/cmake_project ~/tiledb_projects/zip_code_example
$ cd ~/tiledb_projects/zip_code_example
You should now be able to build the example program:
$ mkdir build
$ cd build
$ cmake .. && make
In the remainder of the tutorial, we will be building up src/main.cc
, which
will eventually contain our full code example.
Reading and writing CSV data on S3¶
For simplicity, in this tutorial we will implement a simple CSV parser for the specific format instead of linking with a full-fledged parsing library. Note however that the parsing methods we implement here are quite slow; for large datasets it would be necessary to use a dedicated CSV parsing library. Much of the runtime in ingesting the sample US ZIP code data is spent in these parsing functions.
In order to support reading CSV data hosted on S3, we will take advantage of TileDB’s virtual filesystem (VFS) interface, which exposes a simple read/write interface over files on any supported backend (which includes S3). Using TileDB’s VFS interface allows us to directly read from a CSV file hosted on S3 without having to use an S3 SDK to correctly form and issue requests to an API.
The input CSV data in this case is a very simple format, and looks like this:
ZIP,LAT,LNG
00601,18.180555, -66.749961
00602,18.361945, -67.175597
00603,18.455183, -67.119887
00606,18.158345, -66.932911
When visualized as a scatter plot, each ZIP code corresponds to a single unique point in 2D space. Longitude degrees west are represented in this dataset as negative values:
Add the following two functions at the top of src/main.cc
. The first,
split_csv_row()
, will split a comma-delimited string of the above
format into a vector of strings, one element per column. The second,
parse_csv()
, uses split_csv_row()
to read each line of the input
CSV file and make a callback on each row with that row’s column values.
The input argument input_csv
can be a full URI pointing to any backend
that TileDB supports. That means to parse a CSV file from S3, all you
need to do is pass a URI such as s3://bucket-name/path/to/file.csv
and TileDB’s VFS layer will take care of the rest. To parse a CSV file
from your local disk, use a URI such as file:///path/to/file.csv
or unqualified file.csv
. No code changes are required, illustrating
the convenience of TileDB’s VFS interface.
The array schema¶
Before ingesting data, we need to design an array schema to hold the data. In this case, the geospatial data is two-dimensional and sparse, so we will ingest the data into a 2D sparse array.
When reading from the array, we want to be able to make queries on spatial “slices” of the array, for example to return all ZIP codes within a given latitude/longitude bounding region. Therefore, the dimensions of our array will be latitude and longitude. Each cell in the array will be identified by a unique latitude/longitude coordinate pair.
Each latitude/longitude pair is associated with a single value, a ZIP code.
So, we will add a single zip
attribute to the array to store the ZIP
code for each point. The zip
attribute will be uint32_t
in order
to store ZIP code values larger than 2^16-1
(e.g. 90210
, which cannot
fit in a uint16_t
).
Note
Currently real- and negative-valued dimensions are supported in TileDB,
but still considered an “experimental” feature. Therefore, for this tutorial
we will perform a conversion step on the fixed-precision coordinate values
to convert them losslessly to and from uint32_t
values.
Once we have decided on a schema for the array to hold our data, we can write the function to define the array:
uint32_t
instead of float
, as mentioned in the previous section.
We set the domain of each dimension to the entire range of uint32_t
values
(minus a small amount of space for the “domain expansion” process discussed in
previous tutorials) because we do not know ahead of time the range or min/max of
coordinate values we will be ingesting from the input file. It is only possible
to use such a large domain because this is a sparse array, and “empty space” in
the domain (cells that do not contain data) is not allocated or materialized.
We have chosen a space tile extent of 10000x10000
. Choosing the space tile size
impacts query times for reads, as it changes the minimum bounding rectangles used
internally by TileDB to satisfy spatial read queries. See the other tutorials
on tiling for more information about this.
Ingesting CSV data¶
We will write the two simple functions that will losslessly convert the
fixed-precision latitude/longitude values in the CSV file to and from
uint32_t
values suitable for coordinates in the array:
All that is left for ingestion is to write the function that uses the
parse_csv()
function from earlier to parse the column values from
a CSV file on S3, transforms the coordinates to uint32_t
, and
issues a write query to TileDB:
As seen in previous tutorials we use an “unordered” write which specifies that we are providing explicit coordinates for every cell being written, in no particular order.
Now modify the main()
function to call these functions with command-line
arguments that specify the path of the input .csv file and the output
TileDB array, and we have a complete ingestion program:
Build and run the program to ingest a .csv file:
$ make
$ export AWS_ACCESS_KEY_ID=...
$ export AWS_SECRET_ACCESS_KEY=...
$ ./ExampleExe s3://my-bucket-name/zip-codes.csv s3://my-bucket-name/zip-codes-array
This will read the file zip-codes.csv
from an S3 bucket named my-bucket-name
,
create a new array named zip-codes-array
in the same bucket, and write the
ingested data into it.
You can freely mix and match URI schemes, e.g. to ingest a local CSV file
zip-codes.csv
from the current directory into an S3 bucket:
$ ./ExampleExe zip-codes.csv s3://my-bucket-name/zip-codes-array
Or the reverse (ingesting from S3 to a local array):
$ ./ExampleExe s3://my-bucket-name/zip-codes.csv zip-codes-array
Or entirely locally:
$ ./ExampleExe zip-codes.csv zip-codes-array
Note that there is no code change or rebuild required to mix and match URI schemes.
Slicing data from the array¶
To complete the tutorial, we will write a simple function that reads a “slice” (rectangular region) of ZIP code data from the TileDB array created by the ingestor, and then writes the resulting data to a new .csv file. We will slice in the New England region of the US, between 40–45˚ N and 70–80˚ W (colored orange in the following figure):
The function slice_and_write()
below performs the read query to select
the points of interest.
Once we have read the coordinate values and corresponding ZIP code data in
memory, this function calls a simple helper function write_csv()
that
writes the resulting data in CSV format to a new file. We also use TileDB’s
VFS interface here to support writing the results to S3:
Modify the main()
function to take a third argument for the name of the
output CSV file to create, and invoke the slice_and_write()
function:
Now build and run the example, removing the ingested array from previous steps (if it exists). Here we ingest a local CSV file to a local array, and then write the slice results to a new file on S3:
$ make
$ rm -r my_array_name
$ ./ExampleExe zip-codes.csv my_array_name s3://my-bucket-name/output.csv
Just as before, the URI schemes can be freely mixed and matched.
Full code¶
If you would like to try this on your own, here is the source code for the project.