The OutSeekR package implements a statistical approach to detecting outliers in RNA-seq and related data.
An analysis using OutSeekR should start with a matrix or data frame of normalized RNA-seq data, e.g., fragments per kilobase of transcript per million fragments mapped (FPKM), or similar data; in the case of RNA-seq data, we require normalized values rather than the counts themselves because the statistics we calculate assume continuous data. For simplicity, we’ll refer to the rows of the input data set as ‘transcripts’ and the columns of the input data set as ‘samples’. Transcript identifiers should be stored as the row names of the matrix or data frame, and sample identifiers should be stored as the column names.
The statistical approach of OutSeekR centers around the use of five statistics calculated for each transcript in the observed data:
Specifically, it uses the five statistics calculated on the observed data and compares the distributions of these statistics with counterparts calculated using simulated null data. Observed data yielding statistics more extreme than those of the null data suggest the presence of outliers.
Note: Input data with low variation (e.g., where 50% or more of the values are identical) may cause issues in calculating the MAD or performing K-means clustering. It is recommended to exclude such data from the analysis to ensure reliable results. OutSeekR first identifies which transcripts/features contain outlier values. For a given transcript, the null hypothesis is there are no outlier values (i.e. all values come from some unknown population distribution) while the alternative hypothesis is that the transcript contains 1 or more outlier values from another distribution. A small p-value or FDR-value gives evidence for the presence of 1 or more outliers.
After determining which transcripts contain outliers, an iterative procedure is used to identify the exact number of outliers each transcript has. ## Overview
As an overview, the OutSeekR algorithm does the following:
We hypothesize that the underlying data can be represented by one of
four theoretical distributions: the normal, log-normal, exponential, or
gamma distributions. For each transcript, we fit a generalized additive
model for location, scale, and shape (GAMLSS) to the observed data based
on each distribution using gamlss()
from the gamlss package; the
model with the smallest Bayesian Information Criterion corresponds to
the optimal theoretical distribution for the transcript. To account for
additional variability, we also calculate ‘residuals’ for each
transcript as the differences between the empirical quantiles of the
observed data and the quantiles of the optimal theoretical distribution.
Using the same GAMLSS-based method, we identify the optimal distribution
of these residuals from among the normal, log-normal, exponential, or
gamma. (Note that the optimal distributions of a transcript and its
residuals need not be the same.)
With the optimal distribution of each observed transcript and its residuals, we can create the simulated data. We iterate a user-specified number of times—1000 by default. At each step, we randomly select a transcript from the observed data and generate as many values as there are samples in the observed data according to its optimal distribution. We then add noise based on the optimal distribution of its residuals. The final null transcript is the sum of the simulated values and the simulated residuals. The final data set of simulated transcripts will contain as many rows as transcripts selected and the same number of columns as the observed data. While 1000 iterations are sufficient for exploratory analysis, we recommend using at least 10,000 iterations for publication-level results, with 100,000 or even one million iterations providing more robust estimates.
The null data do not contain outliers by construction but are still in some sense representative of the observed data. By calculating the five outlier statistics on these transcripts, we can get a sense of their distributions in data similar to that observed but without outliers; this logic forms the basis for the p-value of the outlier test.
The statistical approach to outlier detection implemented by
OutSeekR is time-consuming. To reduce runtime, the code
in OutSeekR has been written to allow parallelization.
Parallelization is supported through the use of the package future.apply.
future.apply is built on top of the future framework
implemented by the future
package, which provides a uniform way to parallelize code independent of
operating system or computing environment. In particular,
OutSeekR uses the future_*apply()
functions defined in future.apply rather than their
base R equivalents in order to take advantage of a parallelization
strategy set with future::plan()
by the user.
Note: Depending on the size of the input data set
and the number of null transcripts requested, the objects created by
OutSeekR may exceed a limit defined in
future to prevent exporting very large objects, which
will trigger an error with a message referring to
future.globals.maxSize
. This can be avoided by setting
future.globals.maxSize
to a large value, e.g.,
options(future.globals.maxSize = 8000 * 1024^2); # = 8 GB
.
See the
documentation of future.options
for further
details.
The output of the main function in OutSeekR,
detect.outliers()
, is a named list with the following
components:
p.values
: a matrix of unadjusted p-values for the
outlier test run on each transcript in the observed data.fdr
: a matrix of FDR-adjusted p-values for the outlier
test run on each transcript in the observed data.num.outliers
: a vector giving the number of outliers
detected for each transcript based on the threshold.outlier.test.results.list
: a list of length
max(num.outliers) + 1
containing entries
roundN
, where N
is between one and
max(num.outliers) + 1
. roundN
is the data
frame of results for the outlier test after excluding the (N − 1)th outlier sample, with
round1
being for the original data set (i.e., before
excluding any outlier samples).distributions
: a numeric vector indicating the optimal
distribution for each transcript. Possible values are 1 (normal), 2
(log-normal), 3 (exponential), and 4 (gamma).initial.screen.method
: Specifies the statistical
criterion for initial feature selection. Valid options are ‘p-value’ and
‘FDR’.OutSeekR includes a sample data set called
outliers
that we shall use as an example. It is a data
frame consisting of 50 samples across 500 transcripts. Note that
transcript identifiers and sample identifiers are stored as
rownames(outliers)
and colnames(outliers)
,
respectively.
str(outliers, list.len = 5);
#> 'data.frame': 500 obs. of 50 variables:
#> $ S01: num 4.6646 35.139 0.0812 0.9259 16.4565 ...
#> $ S02: num 1.47 63.417 0.183 0.544 9.184 ...
#> $ S03: num 3.462 20.87 0.134 0.279 14.733 ...
#> $ S04: num 7.449 46.638 0.19 0.321 29.328 ...
#> $ S05: num 4.291 38.908 0.222 0.353 10.323 ...
#> [list output truncated]
outliers[1:6, 1:6];
#> S01 S02 S03 S04 S05 S06
#> T001 4.66463723 1.4702647 3.4622036 7.4494185 4.2913230 8.54235165
#> T002 35.13897201 63.4171763 20.8701361 46.6381322 38.9079440 50.53700051
#> T003 0.08116747 0.1826319 0.1336325 0.1898557 0.2218006 0.09160284
#> T004 0.92586980 0.5443416 0.2793449 0.3205108 0.3527784 0.65188688
#> T005 16.45653195 9.1838392 14.7325212 29.3281331 10.3232860 10.45796777
#> T006 0.27666833 0.1454688 0.2383844 0.1947163 0.2205053 0.12940149
outliers[495:500, 45:50];
#> S45 S46 S47 S48 S49 S50
#> T495 1.667967 3.7989164 0.1898684 17.0515778 3.7819927 3.9090922
#> T496 2.764833 1.8242303 4.0182156 2.0022561 1.9854549 1.4610770
#> T497 12.514745 10.6874267 8.0599890 11.1270759 20.6848658 77.9629874
#> T498 3.120374 2.7675979 4.5075638 1.9106114 4.7501422 4.1928879
#> T499 1.381350 0.8598758 1.3305967 1.0744033 0.9516043 0.7567473
#> T500 1.019789 0.9378017 0.8989976 0.3950489 0.6994834 3.2713429
An analysis in OutSeekR is run using the function
detect.outliers()
. With default settings, it can be run by
simply passing the input data set via the data
parameter.
Other parameters include the following:
num.null
: the number of transcripts to generate when
simulating from null distributions; default is 1000.initial.screen.method
: the statistical criterion for
initial gene selection.p.value.threshold
: the p-value threshold for the
outlier test; default is 0.05.fdr.threshold
: the false discovery rate (FDR)-adjusted
p-value threshold for determining the final count of outliers; default
is 0.01.kmeans.nstart
: the number of random starts when
computing k-means fraction; default is 1. See
?stats::kmeans
for further details.In our first example, we pass the outliers
data frame
and use the defaults for most other parameters. To keep runtime
manageable, we limit the number of null transcripts generated to 1,000.
Prior to running detect.outliers()
, we set up parallel
processing. See Parallelization
in OutSeekR for further details.
# Set random seed for reproducibility.
set.seed(371892);
# Set up parallel processing.
future::plan(future::multisession);
outlier.test.run.1 <- detect.outliers(
data = outliers,
num.null = 1e3
);
str(outlier.test.run.1, max.level = 2);
#> List of 6
#> $ p.values : num [1:500, 1:3] 0.000999 0.214785 0.227772 0.000999 0.411588 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ fdr : num [1:500, 1:3] 0.00393 0.35211 0.36856 0.00393 0.5562 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ num.outliers : Named int [1:500] 1 0 0 1 0 0 0 0 2 1 ...
#> ..- attr(*, "names")= chr [1:500] "T001" "T002" "T003" "T004" ...
#> $ outlier.test.results.list:List of 3
#> ..$ round1:'data.frame': 500 obs. of 9 variables:
#> ..$ round2:'data.frame': 500 obs. of 9 variables:
#> ..$ round3:'data.frame': 500 obs. of 9 variables:
#> $ distributions : Named int [1:500] 2 2 4 2 2 2 1 1 3 2 ...
#> ..- attr(*, "names")= chr [1:500] "T001" "T002" "T003" "T004" ...
#> $ initial.screen.method : chr "fdr"
# Restore sequential processing.
future::plan(future::sequential);
Unadjusted p-values for the outlier test on each transcript are
contained in the p.values
entry of the object returned by
detect.outliers()
; FDR-adjusted p-values are contained in
the fdr
entry.
head(outlier.test.run.1$p.values);
#> round1 round2 round3
#> T001 0.000999001 0.337662338 0.543456543
#> T002 0.214785215 0.381618382 0.392607393
#> T003 0.227772228 0.435564436 0.980019980
#> T004 0.000999001 0.004995005 0.003996004
#> T005 0.411588412 0.771228771 0.848151848
#> T006 0.016983017 0.453546454 0.438561439
head(outlier.test.run.1$fdr);
#> round1 round2 round3
#> T001 0.003933075 0.85735160 0.99699899
#> T002 0.352106909 0.90004335 0.99699899
#> T003 0.368563475 0.93496247 0.99699899
#> T004 0.003933075 0.02744508 0.07684623
#> T005 0.556200556 0.98849748 0.99699899
#> T006 0.035529324 0.93860388 0.99699899
As explained in [Overview], an iterative procedure is used to determine the exact number of outliers per transcript. The iterations stop after the p-value or FDR-value exceeds the specified threshold.
The entries num.outliers
give transcript-specific counts
of the number of outliers based on the threshold of specified
initial.screen.method
.
We can get a quick view of the distribution of the number of outliers
across transcripts using table()
:
More granular results can be found in the
outlier.test.results.list
entry of the object returned by
detect.outliers()
. It is a named list of data frames, where
each reflects the results of a different ‘round’ of outlier testing; by
‘round’, we mean the process of excluding the most outlying sample (for
all rounds except the first), calculating the five outlier statistics,
ranking them alongside the statistics of the null data, computing the
rank product, and computing the unadjusted p-value. Each data frame in
outlier.test.results.list
contains the transcript and
sample identifiers, the five statistics calculated on the observed
transcript, the rank product, and the unadjusted and FDR-adjusted
p-value for the outlier test.
str(outlier.test.run.1$outlier.test.results.list);
#> List of 3
#> $ round1:'data.frame': 500 obs. of 9 variables:
#> ..$ sample : chr [1:500] "S11" "S46" "S43" "S10" ...
#> ..$ zrange.mean : num [1:500] 7.1 4.61 5.03 6.31 4.8 ...
#> ..$ zrange.median : num [1:500] 27.5 5.99 5.87 31.95 5.21 ...
#> ..$ zrange.trimmean : num [1:500] 25.15 5.94 6.74 17.87 6.1 ...
#> ..$ fraction.kmeans : num [1:500] 0.02 0.12 0.36 0.06 0.38 0.2 0.46 0.46 0.04 0.04 ...
#> ..$ cosine.similarity: num [1:500] 0.894 0.998 0.994 0.904 0.999 ...
#> ..$ rank.product : num [1:500] 2 237.99 250.1 3.46 420.32 ...
#> ..$ p.value : num [1:500] 0.000999 0.214785 0.227772 0.000999 0.411588 ...
#> ..$ fdr : num [1:500] 0.00393 0.35211 0.36856 0.00393 0.5562 ...
#> $ round2:'data.frame': 500 obs. of 9 variables:
#> ..$ sample : chr [1:500] "S31" "S33" "S18" "S11" ...
#> ..$ zrange.mean : num [1:500] 4.58 4.65 4.95 5.88 4.41 ...
#> ..$ zrange.median : num [1:500] 5.7 5.67 4.83 15.72 4.78 ...
#> ..$ zrange.trimmean : num [1:500] 5.66 6.13 6.16 22.33 5.3 ...
#> ..$ fraction.kmeans : num [1:500] 0.1429 0.449 0.4082 0.0408 0.4082 ...
#> ..$ cosine.similarity: num [1:500] 1 0.997 0.999 0.868 1 ...
#> ..$ rank.product : num [1:500] 348.04 389.84 440.63 5.92 699.47 ...
#> ..$ p.value : num [1:500] 0.338 0.382 0.436 0.005 0.771 ...
#> ..$ fdr : num [1:500] 0.8574 0.9 0.935 0.0274 0.9885 ...
#> $ round3:'data.frame': 500 obs. of 9 variables:
#> ..$ sample : chr [1:500] "S12" "S16" "S11" "S24" ...
#> ..$ zrange.mean : num [1:500] 3.89 4.88 4.16 6.96 4.13 ...
#> ..$ zrange.median : num [1:500] 4.18 5.42 3.79 12.32 4.12 ...
#> ..$ zrange.trimmean : num [1:500] 4.57 6.29 4.93 17.89 4.82 ...
#> ..$ fraction.kmeans : num [1:500] 0.1875 0.4583 0.5 0.0208 0.4583 ...
#> ..$ cosine.similarity: num [1:500] 0.999 0.998 1 0.942 0.999 ...
#> ..$ rank.product : num [1:500] 535.18 400.45 855.99 4.93 755.54 ...
#> ..$ p.value : num [1:500] 0.543 0.393 0.98 0.004 0.848 ...
#> ..$ fdr : num [1:500] 0.997 0.997 0.997 0.0768 0.997 ...
It may be of use to combine the results across all rounds of outlier
testing into a single data frame. One way to do this (while retaining
the round information) using lapply()
and
rbind()
is shown below:
outlier.test.results.combined <- lapply(
X = seq_along(outlier.test.run.1$outlier.test.results.list),
FUN = function(i) {
df <- outlier.test.run.1$outlier.test.results.list[[i]];
df$round <- i;
df <- df[, c(
'round',
colnames(outlier.test.run.1$outlier.test.results.list[[i]])
)];
}
);
outlier.test.results.combined <- do.call(
what = 'rbind',
args = outlier.test.results.combined
);
# Combining the data frames produces duplicates in the row names. R
# will de-duplicate them, but as all the necessary information is
# included in the columns of the data frame (specifically, 'round' and
# 'transcript'), we'll simply discard the row names.
rownames(outlier.test.results.combined) <- NULL;
head(outlier.test.results.combined);
#> round sample zrange.mean zrange.median zrange.trimmean fraction.kmeans
#> 1 1 S11 7.095115 27.495396 25.151330 0.02
#> 2 1 S46 4.608063 5.987464 5.935080 0.12
#> 3 1 S43 5.026914 5.867812 6.742630 0.36
#> 4 1 S10 6.311975 31.952082 17.871927 0.06
#> 5 1 S37 4.797063 5.213030 6.098033 0.38
#> 6 1 S22 6.538349 15.108789 12.427457 0.20
#> cosine.similarity rank.product p.value fdr
#> 1 0.8943633 2.000000 0.000999001 0.003933075
#> 2 0.9976231 237.992328 0.214785215 0.352106909
#> 3 0.9940453 250.103114 0.227772228 0.368563475
#> 4 0.9042747 3.457366 0.000999001 0.003933075
#> 5 0.9987598 420.317069 0.411588412 0.556200556
#> 6 0.9687042 20.566080 0.016983017 0.035529324
We can observe the optimal theoretical distribution for each transcript and the frequencies across all transcripts.
head(outlier.test.run.1$distributions);
#> T001 T002 T003 T004 T005 T006
#> 2 2 4 2 2 2
table(outlier.test.run.1$distributions);
#>
#> 1 2 3 4
#> 54 251 11 184
The results are numeric codes, with 1 corresponding to the normal distribution, 2 the log-normal distribution, 3 the exponential distribution, and 4 the gamma distribution.
Lastly, we show a second run of the algorithm where we adjust the p-value and FDR thresholds:
# Set up parallel processing.
future::plan(future::multisession);
outlier.test.run.2 <- detect.outliers(
data = outliers,
num.null = 1e3,
initial.screen.method = 'fdr',
p.value.threshold = 0.25,
fdr.threshold = 0.05
);
# Restore sequential processing.
future::plan(future::sequential);
str(outlier.test.run.2, max.level = 2);
#> List of 6
#> $ p.values : num [1:500, 1:4] 0.000999 0.208791 0.22977 0.000999 0.405594 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ fdr : num [1:500, 1:4] 0.00252 0.34116 0.36941 0.00252 0.54758 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ num.outliers : Named int [1:500] 1 0 0 3 0 1 0 0 2 2 ...
#> ..- attr(*, "names")= chr [1:500] "T001" "T002" "T003" "T004" ...
#> $ outlier.test.results.list:List of 4
#> ..$ round1:'data.frame': 500 obs. of 9 variables:
#> ..$ round2:'data.frame': 500 obs. of 9 variables:
#> ..$ round3:'data.frame': 500 obs. of 9 variables:
#> ..$ round4:'data.frame': 500 obs. of 9 variables:
#> $ distributions : Named int [1:500] 2 2 4 2 2 2 1 1 3 2 ...
#> ..- attr(*, "names")= chr [1:500] "T001" "T002" "T003" "T004" ...
#> $ initial.screen.method : chr "fdr"
# Examine p-value and FDR matrices.
head(outlier.test.run.2$p.values);
#> round1 round2 round3 round4
#> T001 0.000999001 0.310689311 0.751248751 0.8211788
#> T002 0.208791209 0.264735265 0.290709291 0.6823177
#> T003 0.229770230 0.444555445 0.966033966 0.8791209
#> T004 0.000999001 0.001998002 0.000999001 0.6163836
#> T005 0.405594406 0.744255744 0.800199800 0.8951049
#> T006 0.007992008 0.428571429 0.427572428 0.4995005
head(outlier.test.run.2$fdr);
#> round1 round2 round3 round4
#> T001 0.00252273 0.796233470 0.99089551 0.9849588
#> T002 0.34116211 0.731312886 0.99089551 0.9849588
#> T003 0.36940551 0.900674129 0.99089551 0.9849588
#> T004 0.00252273 0.008394966 0.01560939 0.9849588
#> T005 0.54758076 0.965355888 0.99089551 0.9849588
#> T006 0.01658093 0.900674129 0.99089551 0.9849588
# Check the distribution of number of outliers detected.
table(outlier.test.run.2$num.outliers);
#>
#> 0 1 2 3
#> 243 116 105 36