Variable-length Attributes¶
In this tutorial we will learn how to use variable-length attributes. It is recommended to read the tutorial on dense arrays first.
Program | Links |
variable_length |
![]() ![]() |
Basic concepts and definitions¶
Creating the array¶
This is similar to what we covered in the simple dense array example. The only
difference is that our two attributes are variable-length. Specifically,
a1
will store strings, whereas a2
will store a variable-length
list of integers. For Python, variable-length attributes are created by
passing var = True
to the Attr
constructor.
Writing to the array¶
Since each cell may now store a variable number of values, when writing, you must somehow tell TileDB which values belong to which cell. In TileDB, you do this by providing two input buffers (instead of one as in the fixed-length attributes); one buffer holds the actual cell data, whereas another stores the starting offsets (in bytes) of each variable-length cell.
Suppose we wish to populate the array with cell values on a1
and a2
as shown in the figure below.
Let’s take a look at the code below and focus on a1
first. Notice that
we create buffer a1_data
which stores the cell values in row-major
order, e.g., cell (1,1)
stores a
, (1,2)
stores bb
, etc.
If a cell stores more than one value, we just place its values
in the buffer contiguously, such as bb
for cell (1,2)
. The problem
is that, only by looking at a1_data
, TileDB has no way of
discerning the cell value “limits”. Therefore, we construct an
extra buffer a1_off
which stores the offsets where the first
value of each cell starts in a1_data
, e.g., a1_off[0] = 0
means
that a
starts at offset 0
in a1_data
(in bytes),
a1_off[1] = 1
means bb
starts at offset 1
,
a1_off[2] = 3
means ccc
starts at offset 3
, etc.
Reading from the array¶
We focus on subarray [1,2], [2,4]
. Recall that, in order to read
from a TileDB array with C++, we must allocate space for the buffers
that will hold the result; the Python API allocates space automatically.
For the variable-length case, this is a challenging task, since we do not know how many values each cell may be storing. Fortunately, TileDB has an auxiliary function that gives you an upper bound on how many elements your buffers need to store the results (note that this is an approximation). You can prepare the buffers as follows. Once again, we need two buffers for each attribute, one for the data and one for the offsets.
Next, we perform the query as usual, but now we set both
the data and offset buffers. After completion, a1_data
and
a2_data
will hold the result cell values , whereas a1_off
and a2_off
will store the starting offsets (in bytes)
of the cell values in a1_data
and a2_data
, respectively.
More specifically, a1_data
will contain bbcccddfghhh
,
a1_off
will contain 0, 2, 5, 7, 8, 9
, a2_data
will
contain 2, 2, 3, 4, 6, 6, 7, 7, 8, 8, 8
and a2_off
will
contain 0, 8, 12, 16, 24, 32
(see figure above).
Warning
For the case of variable-length attributes, you should always use the
auxialiary max_buffer_elements
function to calculate the
appropriate buffer sizes that will hold the result, even if you
know the result size a priori. This is because TileDB may overestimate
the buffer sizes needed and, hence, process a part of the query
upon query.submit()
, yielding an incomplete status (checked
with query.query_status()
). For more information about incomplete
queries, see Incomplete queries. Allocating buffers using the sizes output by
max_buffer_elements
guarantees that the query will be completed
and the whole result will be returned.
Perhaps the most cumbersome task is parsing the cell values given the
data and offset buffers. Here is what we do for the strings of a1
.
We first calculate the string sizes using the offsets buffer. Then,
we create a vector of strings (one per result cell), so that we make it
easy to print later.
For the integers of a2
, we first calculate the element offsets from the
byte offsets in a2_off
, and then we calculate the number of elements
per result cell. Once again, this will simplify printing the result.
Finally, we print the result as follows.
If you compile and run the example of this tutorial as shown below, you should see the following output:
$ g++ -std=c++11 variable_length.cc -o variable_length -ltiledb
$ ./variable_length
a1: bb, a2: 2 2
a1: ccc, a2: 3
a1: dd, a2: 4
a1: f, a2: 6 6
a1: g, a2: 7 7
a1: hhh, a2: 8 8 8
On-disk structure¶
Let us look at the contents of the array of this example on disk.
$ ls -l variable_length_array/
total 8
drwx------ 7 stavros staff 224 Jun 25 15:38 __1561491531226_1561491531226_3e56db7d25a447708a73d3e578622ab4
-rwx------ 1 stavros staff 155 Jun 25 15:38 __array_schema.tdb
-rwx------ 1 stavros staff 0 Jun 25 15:38 __lock.tdb
drwx------ 2 stavros staff 64 Jun 25 15:38 __meta
$ ls -l variable_length_array/__1561491531226_1561491531226_3e56db7d25a447708a73d3e578622ab4/
total 40
-rwx------ 1 stavros staff 945 Jun 25 15:38 __fragment_metadata.tdb
-rwx------ 1 stavros staff 100 Jun 25 15:38 a1.tdb
-rwx------ 1 stavros staff 48 Jun 25 15:38 a1_var.tdb
-rwx------ 1 stavros staff 100 Jun 25 15:38 a2.tdb
-rwx------ 1 stavro staff 124 Jun 25 15:38 a2_var.tdb
Observe that, contrary to the case of fixed-length attributes, TileDB stores two
files for each variable-length attribute. Specifically, a1_var.tdb
and a2_var.tdb
store the actual cell values (which are of variable length), whereas a1.tdb
and
a2.tdb
store the corresponding starting offsets (in bytes). In other words,
TileDB adopts a “columnar” format by splitting the values from the offsets. The
reason behind this choice is better compressibility (later tutorials explain this
in more detail).