The S4Vectors package provides a framework for representing vector-like and list-like objects as S4 objects. It defines two central virtual classes, Vector and List, and a set of generic functions that extend the semantic of ordinary vectors and lists in R. Package developers can easily implement vector-like or list-like objects as Vector and/or List derivatives. A few low-level Vector and List derivatives are implemented in the S4Vectors package itself e.g. Hits, Rle, and DataFrame). Many more are implemented in the IRanges and GenomicRanges infrastructure packages, and in many other Bioconductor packages.
In this vignette, we will rely on simple, illustrative example datasets, rather than large, real-world data, so that each data structure and algorithm can be explained in an intuitive, graphical manner. We expect that packages that apply S4Vectors to a particular problem domain will provide vignettes with relevant, realistic examples.
The S4Vectors
package is available at bioconductor.org and can be downloaded via
BiocManager::install
:
In the context of the S4Vectors
package, a vector-like object is an ordered finite collection of
elements. All vector-like objects have three main properties: (1) a
notion of length or number of elements, (2) the ability to extract
elements to create new vector-like objects, and (3) the ability to be
concatenated with one or more vector-like objects to form larger
vector-like objects. The main functions for these three operations are
length
, [
, and c
. Supporting
these operations provide a great deal of power and many vector-like
object manipulations can be constructed using them.
Some vector-like objects can also have a list-like semantic, which
means that individual elements can be extracted with
[[
.
In S4Vectors and many other Bioconductor packages, vector-like and list-like objects derive from the Vector and List virtual classes, respectively. Note that List is a subclass of Vector. The following subsections describe each in turn.
As a first example of vector-like objects, we’ll look at Rle objects. In R, atomic sequences are typically stored in atomic vectors. But there are times when these object become too large to manage in memory. When there are lots of consecutive repeats in the sequence, the data can be compressed and managed in memory through a run-length encoding where a data value is paired with a run length. For example, the sequence {1, 1, 1, 2, 3, 3} can be represented as values = {1, 2, 3}, run lengths = {3, 1, 2}.
The Rle class defined in the S4Vectors package is used to represent a run-length encoded (compressed) sequence of logical, integer, numeric, complex, character, raw, or factor values. Note that the Rle class extends the Vector virtual class:
## Class "Rle" [package "S4Vectors"]
##
## Slots:
##
## Name: values lengths elementMetadata metadata
## Class: vector_OR_factor integer_OR_LLint DataFrame_OR_NULL list
##
## Extends:
## Class "Vector", directly
## Class "Annotated", by class "Vector", distance 2
## Class "vector_OR_Vector", by class "Vector", distance 2
One way to construct Rle objects is through the Rle constructor function:
set.seed(0)
lambda <- c(rep(0.001, 4500), seq(0.001, 10, length=500),
seq(10, 0.001, length=500))
xVector <- rpois(1e7, lambda)
yVector <- rpois(1e7, lambda[c(251:length(lambda), 1:250)])
xRle <- Rle(xVector)
yRle <- Rle(yVector)
Rle objects are vector-like objects:
## [1] 10000000
## integer-Rle of length 1 with 1 run
## Lengths: 1
## Values : 0
As with ordinary R atomic vectors, it is often necessary to
subset one sequence from another. When this subsetting does not
duplicate or reorder the elements being extracted, the result is called
a subsequence. In general, the [
function can be
used to construct a new sequence or extract a subsequence, but its
interface is often inconvenient and not amenable to optimization. To
compensate for this, the S4Vectors
package supports seven additional functions for sequence extraction:
1.window
- Extracts a subsequence over a specified
region.
2.subset
- Extracts the subsequence specified by a
logical vector.
3.head
- Extracts a consecutive subsequence containing
the first n elements.
4.tail
- Extracts a consecutive subsequence containing
the last n elements.
5.rev
- Creates a new sequence with the elements in the
reverse order.
6.rep
- Creates a new sequence by repeating sequence
elements.
The following code illustrates how these functions are used on an Rle vector:
## integer-Rle of length 10 with 9 runs
## Lengths: 1 1 1 1 1 1 1 1 2
## Values : 4 6 5 4 6 2 6 7 5
## integer-Rle of length 6 with 6 runs
## Lengths: 1 1 1 1 1 1
## Values : 4 6 5 4 6 2
## integer-Rle of length 6 with 5 runs
## Lengths: 1 1 1 1 2
## Values : 6 2 6 7 5
## integer-Rle of length 10 with 9 runs
## Lengths: 2 1 1 1 1 1 1 1 1
## Values : 5 7 6 2 6 4 5 6 4
## integer-Rle of length 20 with 18 runs
## Lengths: 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2
## Values : 4 6 5 4 6 2 6 7 5 4 6 5 4 6 2 6 7 5
## integer-Rle of length 7 with 5 runs
## Lengths: 1 1 2 1 2
## Values : 6 5 6 7 5
The S4Vectors
package uses two generic functions, c
and
append
, for concatenating two Vector derivatives.
The methods for Vector objects follow the definition that these
two functions are given the base package.
## integer-Rle of length 20 with 17 runs
## Lengths: 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1
## Values : 4 6 5 4 6 2 6 7 5 7 6 2 6 4 5 6 4
## integer-Rle of length 20 with 18 runs
## Lengths: 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2
## Values : 4 6 5 4 6 5 4 6 2 6 7 5 4 6 2 6 7 5
In R, for
looping can be an expensive
operation. To compensate for this, the S4Vectors
package provides aggregate
and shiftApply
methods (shiftApply
is a new generic function defined in
S4Vectors
to perform calculations over subsequences of vector-like objects.
The aggregate
function combines sequence extraction
functionality of the window
function with looping
capabilities of the sapply
function. For example, here is
some code to compute medians across a moving window of width 3 using the
function aggregate
:
## integer-Rle of length 10 with 9 runs
## Lengths: 1 1 1 1 1 1 1 1 2
## Values : 4 6 5 4 6 2 6 7 5
## [1] 5 5 5 4 6 6 6 5
The shiftApply
function is a looping operation involving
two vector-like objects whose elements are lined up via a positional
shift operation. For example, the elements of xRle
and
yRle
were simulated from Poisson distributions with the
mean of element i from yRle
being equivalent to the mean of
element i + 250 from xRle
. If we did not know the size of
the shift, we could estimate it by finding the shift that maximizes the
correlation between xRle
and yRle
.
## [1] 0.5739224
The result is shown in Fig.@ref(fig:figshiftcorrs)
When there are lots of consecutive repeats, the memory savings
through an RLE can be quite dramatic. For example, the xRle
object occupies less than one third of the space of the original
xVector
object, while storing the same information:
## [1] 0.3020726
## [1] TRUE
The functions runValue
and runLength
extract the run values and run lengths from an Rle object
respectively:
## [1] 0 1 0 1 0 1
## [1] 780 1 208 1 1599 1
The Rle class supports many of the basic methods associated with R atomic vectors including the Ops, Math, Math2, Summary, and Complex group generics. Here is a example of manipulating Rle objects using methods from the Ops group:
## logical-Rle of length 10000000 with 197127 runs
## Lengths: 780 1 208 1 1599 ... 5 1 91 1 927
## Values : FALSE TRUE FALSE TRUE FALSE ... FALSE TRUE FALSE TRUE FALSE
## integer-Rle of length 10000000 with 1957707 runs
## Lengths: 780 1 208 1 13 1 413 1 ... 1 5 1 91 1 507 1 419
## Values : 0 1 0 1 0 1 0 1 ... 2 0 1 0 1 0 1 0
## logical-Rle of length 10000000 with 210711 runs
## Lengths: 780 1 208 1 13 ... 91 1 507 1 419
## Values : FALSE TRUE FALSE TRUE FALSE ... FALSE TRUE FALSE TRUE FALSE
Here are some from the Summary group:
## [1] 0 26
## [1] 2105185
And here is one from the Math group:
## numeric-Rle of length 10000000 with 1510219 runs
## Lengths: 780 1 208 1 ... 91 1 927
## Values : 0.000000 0.693147 0.000000 0.693147 ... 0.000000 0.693147 0.000000
As with atomic vectors, the cor
and
shiftApply
functions operate on Rle objects:
## [1] 0.5739224
## [1] 0.8519138 0.8517324 0.8517725
For more information on the methods supported by the Rle
class, consult the Rle
man page.
Just as with ordinary R List objects,
List-derived objects support [[
for element
extraction, c
for concatenating, and
lapply
/sapply
for looping. lapply
and sapply
are familiar to many R users since they
are the standard functions for looping over the elements of an
R list object.
In addition, the S4Vectors
package introduces the endoapply
function to perform an
endomorphism equivalent to lapply
, i.e. it returns a
List derivative of the same class as the input rather than a
list object.
An example of List derivative is the DataFrame class:
## Virtual Class "DataFrame" [package "S4Vectors"]
##
## Slots:
##
## Name: elementType elementMetadata metadata
## Class: character DataFrame_OR_NULL list
##
## Extends:
## Class "RectangularData", directly
## Class "List", directly
## Class "DataFrame_OR_NULL", directly
## Class "Vector", by class "List", distance 2
## Class "list_OR_List", by class "List", distance 2
## Class "Annotated", by class "List", distance 3
## Class "vector_OR_Vector", by class "List", distance 3
##
## Known Subclasses: "DFrame"
One way to construct DataFrame objects is through the DataFrame constructor function:
## x y
## "Rle" "Rle"
## x y
## Min. 0.0000000 0.0000000
## 1st Qu. 0.0000000 0.0000000
## Median 0.0000000 0.0000000
## Mean 0.9090338 0.9096009
## 3rd Qu. 0.0000000 0.0000000
## Max. 26.0000000 27.0000000
## x y
## Min. 0.0000000 0.0000000
## 1st Qu. 0.0000000 0.0000000
## Median 0.0000000 0.0000000
## Mean 0.9090338 0.9096009
## 3rd Qu. 0.0000000 0.0000000
## Max. 26.0000000 27.0000000
## DataFrame with 10000000 rows and 2 columns
## x y
## <Rle> <Rle>
## 1 0.5 0.5
## 2 0.5 0.5
## 3 0.5 0.5
## 4 0.5 0.5
## 5 0.5 0.5
## ... ... ...
## 9999996 0.5 0.5
## 9999997 0.5 0.5
## 9999998 0.5 0.5
## 9999999 0.5 0.5
## 10000000 0.5 0.5
For more information on DataFrame objects, consult the DataFrame man page.
See the “An Overview of the IRanges package’’ vignette in the IRanges package for many more examples of List derivatives.
Often when one has a collection of objects, there is a need to attach metadata that describes the collection in some way. Two kinds of metadata can be attached to a Vector object:
metadata
accessor and is represented as an ordinary
list;mcols
accessor (mcols
stands for metadata columns) and is represented as a
DataFrame object.This DataFrame object can be thought
of as the result of binding together one or several vector-like objects
(the metadata columns) of the same length as the Vector object.
Each row of the DataFrame object annotates the corresponding
element of the Vector object.Here is the output of sessionInfo()
on the system on
which this document was compiled:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] S4Vectors_0.45.0 BiocGenerics_0.53.0 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0
## [4] xfun_0.48 maketools_1.3.1 cachem_1.1.0
## [7] knitr_1.48 htmltools_0.5.8.1 rmarkdown_2.28
## [10] buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.3
## [13] IRanges_2.39.2 sass_0.4.9 jquerylib_0.1.4
## [16] compiler_4.4.1 highr_0.11 sys_3.4.3
## [19] tools_4.4.1 evaluate_1.0.1 bslib_0.8.0
## [22] yaml_2.3.10 BiocManager_1.30.25 jsonlite_1.8.9
## [25] rlang_1.1.4