Querying a SOMA experiment

Overview

In this notebook, we’ll take a quick look at how to query a SOMAExperiment using the SOMAExperimentAxisQuery class. This allows for easy selection of data from a SOMAMeasurement by filtering on annotations stored in each axis data frame (i.e., obs and var).

library(tiledbsoma)
#> TileDB Core version '2.26' used by TileDB-R package, but TileDB-SOMA uses '2.27'

Example data

Load the bundled SOMAExperiment containing a subsetted version of the 10X genomics PBMC dataset provided by SeuratObject. This will return a SOMAExperiment object.

experiment <- load_dataset("soma-exp-pbmc-small")
experiment
#> <SOMAExperiment>
#>   uri: /tmp/RtmpyOFOsz/soma-exp-pbmc-small

Querying basics

To perform a query we’ll need to initialize a new SOMAExperimentAxisQuery object, specifying the SOMAExperiment and the SOMAMeasurement within the experiment we want to query.

We can see that our current experiment contains only a single measurement: "RNA".

experiment$ms
#> <SOMACollection>
#>   uri: file:///tmp/RtmpyOFOsz/soma-exp-pbmc-small/ms

To use larger (or smaller) buffer sizes:

ctx <- SOMATileDBContext$new(c(soma.init_buffer_bytes = as.character(2 * 1024**3)))
experiment <- SOMAExperimentOpen(experiment$uri, tiledbsoma_ctx = ctx)

Alternatively, you can have in your environment export TILEDB_SOMA_INIT_BUFFER_BYTES=2147483648 before loading the data.

Now we can construct our query object.

query <- SOMAExperimentAxisQuery$new(
  experiment = experiment,
  measurement_name = "RNA"
)

Once it’s created, we can use the query object to inspect, select, and extract filtered data from the experiment.

For example, we can use n_obs and n_vars to determine the number of observations and variables that passed our filtering criteria. Since we didn’t specify any filtering criteria, these numbers will match the full size of the experiment.

Number of observations:

query$n_obs
#> [1] 80

Number of variables:

query$n_vars
#> [1] 230

We can also extract any data component from the experiment. Here we’ll read in the obs data frame from the query using obs() which returns an iterator of arrow::Table. The iterator is useful when the data is too large to load in memory allowing to stream the data in chunks. This applies to var() as well.

To load the data in memory you can concatenate all chunks of the iterator as shown below.

iterator <- query$obs()
obs <- iterator$concat()
obs
#> Table
#> 80 rows x 9 columns
#> $soma_joinid <int64 not null>
#> $orig.ident <dictionary<values=string, indices=int8>>
#> $nCount_RNA <double>
#> $nFeature_RNA <int32>
#> $RNA_snn_res.0.8 <dictionary<values=string, indices=int8>>
#> $letter.idents <dictionary<values=string, indices=int8>>
#> $groups <large_string>
#> $RNA_snn_res.1 <dictionary<values=string, indices=int8>>
#> $obs_id <large_string>

As a reminder arrow:Table can be easily cast into a tibble

obs$to_data_frame()
#> # A tibble: 80 × 9
#>    soma_joinid orig.ident  nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents
#>          <int> <fct>            <dbl>        <int> <fct>           <fct>        
#>  1           0 SeuratProj…         70           47 0               A            
#>  2           1 SeuratProj…         85           52 0               A            
#>  3           2 SeuratProj…         87           50 1               B            
#>  4           3 SeuratProj…        127           56 0               A            
#>  5           4 SeuratProj…        173           53 0               A            
#>  6           5 SeuratProj…         70           48 0               A            
#>  7           6 SeuratProj…         64           36 0               A            
#>  8           7 SeuratProj…         72           45 0               A            
#>  9           8 SeuratProj…         52           36 0               A            
#> 10           9 SeuratProj…        100           41 0               A            
#> # ℹ 70 more rows
#> # ℹ 3 more variables: groups <chr>, RNA_snn_res.1 <fct>, obs_id <chr>

Alternatively, you can use the iterator, which retrieves data in chunks that are smaller than the soma.init_buffer_bytes context field. You can use the iterator’s method $read_next() to load a chunk in memory.

iterator <- query$obs()
iterator$read_next()
#> Table
#> 80 rows x 9 columns
#> $soma_joinid <int64 not null>
#> $orig.ident <dictionary<values=string, indices=int8>>
#> $nCount_RNA <double>
#> $nFeature_RNA <int32>
#> $RNA_snn_res.0.8 <dictionary<values=string, indices=int8>>
#> $letter.idents <dictionary<values=string, indices=int8>>
#> $groups <large_string>
#> $RNA_snn_res.1 <dictionary<values=string, indices=int8>>
#> $obs_id <large_string>

In this example the full obs table is relatively small and fits all in one chunk.

For a bigger SOMADataFrame you can check if the iteration has finished by checking the logical $read_complete().

Here we demonstrate by creating a new iterator.

iterator <- experiment$obs$read()
iterator$read_complete()
#> [1] FALSE
iterator$read_next()
#> Table
#> 80 rows x 9 columns
#> $soma_joinid <int64 not null>
#> $orig.ident <dictionary<values=string, indices=int8>>
#> $nCount_RNA <double>
#> $nFeature_RNA <int32>
#> $RNA_snn_res.0.8 <dictionary<values=string, indices=int8>>
#> $letter.idents <dictionary<values=string, indices=int8>>
#> $groups <large_string>
#> $RNA_snn_res.1 <dictionary<values=string, indices=int8>>
#> $obs_id <large_string>
iterator$read_complete()
#> [1] TRUE

We can also access the expression via X().

Similarly to obs() and var(), X() is intended for iteration, but in this case we have access to two different iterators, and thus X() returns a reader that gives you access to an iterator for arrow::Table and one for Matrix::sparse_matrix.

Let’s take a look at the Arrow Table iterator:

reader <- query$X(layer_name = "counts")
table_irerator <- reader$tables()
table_irerator$read_next()
#> Table
#> 4456 rows x 3 columns
#> $soma_dim_0 <int64 not null>
#> $soma_dim_1 <int64 not null>
#> $soma_data <double not null>

As in the obs example the data is small enough to fit in one chunk. For bigger data you can user iterator$read_complete() to check the status of iteration and iterator$concat() to concatenate the rest of the chunks.

The iterator for Matrix::sparse_matrix works in the same way. Keep in mind that the matrix format is dgTMatrix as it is the most memory-efficient and the only format type that can be easily iterated. And most importantly, the resulting object is a “view” of the full matrix with the original shape and indexes but only with data corresponding to the query coordinates or filters (see section below).

reader <- query$X(layer_name = "counts")
iterator <- reader$sparse_matrix()
str(iterator$read_next())
#> Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
#>   ..@ i       : int [1:4456] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..@ j       : int [1:4456] 1 5 8 11 22 30 33 34 36 38 ...
#>   ..@ Dim     : int [1:2] 80 230
#>   ..@ Dimnames:List of 2
#>   .. ..$ : NULL
#>   .. ..$ : NULL
#>   ..@ x       : num [1:4456] 1 1 3 1 1 4 1 5 1 1 ...
#>   ..@ factors : list()

Adding filters

Adding filters requires creating a SOMAAxisQuery object that allows you to define coordinates, value filters, or both for an axis.

Here we’ll create a query for obs that slices the first 40 rows, and then filters that subset based on the nCount_RNA column.

obs_query <- SOMAAxisQuery$new(
  coords = list(soma_joinid = 0:39),
  value_filter = "nCount_RNA > 100"
)

To apply this filter we’ll pass it to a new SOMAExperimentAxisQuery object.

query <- SOMAExperimentAxisQuery$new(
  experiment = experiment,
  measurement_name = "RNA",
  obs_query = obs_query
)

Let’s see how many observations this query identified.

query$n_obs
#> [1] 26

As before, we can load the obs data frame into memory but now it only includes the filtered observations.

obs <- query$obs(column_names = c("obs_id", "nCount_RNA"))$concat()
obs$to_data_frame()
#> # A tibble: 26 × 2
#>    obs_id         nCount_RNA
#>    <chr>               <dbl>
#>  1 TGACTGGATTCTCA        127
#>  2 AGTCAGACTGCACA        173
#>  3 AGAGATGATCTCGC        191
#>  4 GGGTAACTCTAGTG        101
#>  5 CTAAACCTGTGCAT        168
#>  6 TTGGTACTGAATCC        135
#>  7 TACATCACGCTAAC        109
#>  8 TTACCATGAATCGC        298
#>  9 ATAGGAGAAACAGA        406
#> 10 GCGCACGACTTTAC        213
#> # ℹ 16 more rows

As well as the X matrix in two different formats:

arrow::Table

query$X("counts")$tables()$concat()
#> Table
#> 1485 rows x 3 columns
#> $soma_dim_0 <int64 not null>
#> $soma_dim_1 <int64 not null>
#> $soma_data <double not null>

Matrix::sparse_matrix in dgTMatrix format.

str(query$X("counts")$sparse_matrix()$concat())
#> Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
#>   ..@ i       : int [1:1485] 3 3 3 3 3 3 3 3 3 3 ...
#>   ..@ j       : int [1:1485] 8 11 22 30 31 32 33 34 37 38 ...
#>   ..@ Dim     : int [1:2] 80 230
#>   ..@ Dimnames:List of 2
#>   .. ..$ : NULL
#>   .. ..$ : NULL
#>   ..@ x       : num [1:1485] 13 1 6 5 2 1 2 2 1 1 ...
#>   ..@ factors : list()

For a re-indexed and re-shaped sparse matrix see the section below.

Export to an annotated re-indexed sparse matrix

Any component of the queried SOMAExperiment can be exported to a [sparse matrix][Matrix::sparseMatrix-class] using the to_sparse_matrix() method.

For example, let’s create a sparse matrix of the filtered expression data. We’ll create a new query that returns a smaller subset of the data to make the output easier to read.

query <- SOMAExperimentAxisQuery$new(
  experiment = experiment,
  measurement_name = "RNA",
  obs_query = SOMAAxisQuery$new(coords = 0:9),
  var_query = SOMAAxisQuery$new(coords = 0:9)
)

Then we indicate that we want to access the "counts" layer of the "X" collection.

query$to_sparse_matrix(
  collection = "X",
  layer = "counts"
)
#> 10 x 10 sparse Matrix of class "dgTMatrix"
#>   [[ suppressing 10 column names '0', '1', '2' ... ]]
#>                       
#> 0 . 1 . . . 1 . .  3 .
#> 1 . . . 1 . . . .  7 .
#> 2 . . . . . . . . 11 .
#> 3 . . . . . . . . 13 .
#> 4 . . . 1 . . . .  3 .
#> 5 . . . 1 . . . .  4 .
#> 6 . . . . . . . .  6 .
#> 7 . . . 1 . . . .  4 .
#> 8 . . . . . . . .  2 .
#> 9 . 1 . . . . . . 21 .

By default, the dimensions are named using soma_joinid values which are unique to each observation and variable. However, dimension names can come from any column in the obs and var arrays that uniquely identifies each record. For an expression matrix it makes sense to name the dimensions using cell barcodes and gene names, which are stored in the obs_id and var_id columns, respectively.

query$to_sparse_matrix(
  collection = "X",
  layer = "counts",
  obs_index = "obs_id",
  var_index = "var_id"
)
#> 10 x 10 sparse Matrix of class "dgTMatrix"
#>   [[ suppressing 10 column names 'MS4A1', 'CD79B', 'CD79A' ... ]]
#>                                    
#> ATGCCAGAACGACT . 1 . . . 1 . .  3 .
#> CATGGCCTGTGCAT . . . 1 . . . .  7 .
#> GAACCTGATGAACC . . . . . . . . 11 .
#> TGACTGGATTCTCA . . . . . . . . 13 .
#> AGTCAGACTGCACA . . . 1 . . . .  3 .
#> TCTGATACACGTGT . . . 1 . . . .  4 .
#> TGGTATCTAAACAG . . . . . . . .  6 .
#> GCAGCTCTGTTTCT . . . 1 . . . .  4 .
#> GATATAACACGCAT . . . . . . . .  2 .
#> AATGTTGACAGTCA . 1 . . . . . . 21 .

We can use this method for any of the SOMAExperiment’s collections. Let’s access the t-SNE coordinates stored in the obsm collection’s X_tsne layer, populating the row names with cell barcodes.

query$to_sparse_matrix(
  collection = "obsm",
  layer = "X_tsne",
  obs_index = "obs_id"
)
#> 10 x 2 sparse Matrix of class "dgTMatrix"
#>                          0           1
#> ATGCCAGAACGACT   0.8675977  -8.1007483
#> CATGGCCTGTGCAT  -7.3925306  -8.7717451
#> GAACCTGATGAACC -28.2064258   0.2410102
#> TGACTGGATTCTCA  16.3480689 -11.1633255
#> AGTCAGACTGCACA   1.9113998 -11.1929311
#> TCTGATACACGTGT   3.1475998  -9.9369312
#> TGGTATCTAAACAG  17.8526863  -9.8978901
#> GCAGCTCTGTTTCT  -6.4912187  -8.3874434
#> GATATAACACGCAT   1.3305297  -9.6807936
#> AATGTTGACAGTCA  16.9642732  -9.4295980

Export to Seurat

The query object also contains methods for loading in results as a Seurat object (or any of Seurat’s component classes). As with the to_sparse_matrix() method, we can specify the obs_index and var_index to use for naming the dimensions of the resulting object.

query <- SOMAExperimentAxisQuery$new(
  experiment = experiment,
  measurement_name = "RNA"
)

query$to_seurat(
  X_layers = c(counts = "counts", data = "data"),
  obs_index = "obs_id",
  var_index = "var_id"
)
#> Warning: Adding a command log without an assay associated with it
#> An object of class Seurat 
#> 230 features across 80 samples within 1 assay 
#> Active assay: RNA (230 features, 0 variable features)
#>  2 layers present: counts, data
#>  2 dimensional reductions calculated: pca, tsne