Package 'OutSeekR'

Title: Statistical Approach to Outlier Detection in RNA-Seq and Related Data
Description: An approach to outlier detection in RNA-seq and related data based on five statistics. 'OutSeekR' implements an outlier test by comparing the distributions of these statistics in observed data with those of simulated null data.
Authors: Jee Yun Han [aut], John Sahrmann [aut], Jaron Arbet [ctb], Paul Boutros [aut, cre, cph]
Maintainer: Paul Boutros <[email protected]>
License: GPL-2
Version: 1.0.0
Built: 2024-11-20 06:26:43 UTC
Source: https://github.com/cran/OutSeekR

Help Index


Calculate p-values

Description

Calculate p-values for each sample of a single transcript.

Usage

calculate.p.values(
  x,
  x.distribution,
  x.zrange.mean,
  x.zrange.median,
  x.zrange.trimmean,
  x.fraction.kmeans,
  x.cosine.similarity,
  null.zrange.mean,
  null.zrange.median,
  null.zrange.trimmean,
  null.fraction.kmeans,
  null.cosine.similarity,
  kmeans.nstart = 1
)

Arguments

x

A numeric vector of values for an observed transcript.

x.distribution

A numeric code corresponding to the optimal distribution of x as returned by identify.bic.optimal.data.distribution().

x.zrange.mean

A number, the range of the z-scores calculated using the mean and standard deviation of x.

x.zrange.median

A number, the range of the z-scores calculated using the median and median absolute deviation of x.

x.zrange.trimmean

A number, the range of the z-scores calculated using the trimmed mean and trimmed standard deviation of x.

x.fraction.kmeans

A number, the k-means fraction of x.

x.cosine.similarity

A number, the cosine similarity of x.

null.zrange.mean

A numeric vector, the ranges of the z-scores calculated using the mean and standard deviation of each transcript in the null data.

null.zrange.median

A numeric vector, the ranges of the z-scores calculated using the median and median absolute deviation of each transcript in the null data.

null.zrange.trimmean

A numeric vector, the ranges of the z-scores calculated using the trimmed mean and trimmed standard deviation of each transcript in the null data.

null.fraction.kmeans

A numeric vector, the k-means fraction of each transcript in the null data.

null.cosine.similarity

A numeric vector, the cosine similarity of each transcript in the null data.

kmeans.nstart

The number of random starts when computing k-means fraction; default is 1. See ?stats::kmeans for further details.

Value

A list consisting of the following entries:

  • p.values: a vector of p-values for the outlier test run on each sample (up until the p-value exceeds p.value.threshold); and

  • outlier.statistics.list, a list of vectors containing the values of the outlier statistics calculated from the remaining samples. The list will be of length equal to one plus the total number of outliers (i.e., the number of samples with an outlier test p-value less than p.value.threshold) and will contain entries outlier.statistics.N, where N is between zero and the total number of outliers. outlier.statistics.N is the vector of outlier statistics after excluding the Nth outlier sample, with outlier.statistics.0 being for the complete transcript.

Examples

data(example.data.for.calculate.p.values);
i <- 1; # row index of transcript to test
calculate.p.values(
   x = example.data.for.calculate.p.values$data[i,],
   x.distribution = example.data.for.calculate.p.values$x.distribution[i],
   x.zrange.mean = example.data.for.calculate.p.values$x.zrange.mean[i],
   x.zrange.median = example.data.for.calculate.p.values$x.zrange.median[i],
   x.zrange.trimmean = example.data.for.calculate.p.values$x.zrange.trimmean[i],
   x.fraction.kmeans = example.data.for.calculate.p.values$x.fraction.kmeans[i],
   x.cosine.similarity = example.data.for.calculate.p.values$x.cosine.similarity[i],
   null.zrange.mean = example.data.for.calculate.p.values$null.zrange.mean,
   null.zrange.median = example.data.for.calculate.p.values$null.zrange.median,
   null.zrange.trimmean = example.data.for.calculate.p.values$null.zrange.trimmean,
   null.fraction.kmeans = example.data.for.calculate.p.values$null.fraction.kmeans,
   null.cosine.similarity = example.data.for.calculate.p.values$null.cosine.similarity,
   kmeans.nstart = example.data.for.calculate.p.values$kmeans.nstart
   );

Calculate residuals

Description

Calculate residuals between quantiles of the input and quantiles of one of four distributions: normal, log-normal, exponential, or gamma.

Usage

calculate.residuals(x, distribution)

Arguments

x

A numeric vector.

distribution

A number corresponding to the optimal distribution of x as returned by, e.g., identify.bic.optimal.data.distribution(). One of

  • 1 = normal,

  • 2 = log-normal,

  • 3 = exponential, and

  • 4 = gamma.

Value

A numeric vector of the same length as x. Names are not retained.

Examples

# Generate fake data.
set.seed(1234);
x <- rgamma(
    n = 20,
    shape = 2,
    scale = 2
    );
names(x) <- paste(
    'Sample',
    seq_along(x),
    sep = '.'
    );
calculate.residuals(
    x = x,
    distribution = 4
    );

Detect outliers

Description

Detect outliers in normalized RNA-seq data.

Usage

detect.outliers(
  data,
  num.null = 1000,
  initial.screen.method = c("fdr", "p.value"),
  p.value.threshold = 0.05,
  fdr.threshold = 0.01,
  kmeans.nstart = 1
)

Arguments

data

A matrix or data frame of normalized RNA-seq data, organized with transcripts on rows and samples on columns. Transcript identifiers should be stored as rownames(data).

num.null

The number of transcripts to generate when simulating from null distributions; default is 1000. We recommend using at least 10,000 iterations for publication-level results, with 100,000 or even one million iterations providing more robust estimates.

initial.screen.method

The statistical criterion for initial gene selection; valid options are 'FDR' and 'p-value'.

p.value.threshold

The p-value threshold for the outlier test; default is 0.05. Once the p-value for a sample exceeds p.value.threshold, testing for that transcript ceases, and all remaining samples will have p-values equal to NA.

fdr.threshold

The false discovery rate (FDR)-adjusted p-value threshold for determining the final count of outliers; default is 0.01.

kmeans.nstart

The number of random starts when computing k-means fraction; default is 1. See ?stats::kmeans for further details.

Value

A list consisting of the following entries:

  • p.values: a matrix of unadjusted p-values for the outlier test run on each transcript in data.

  • fdr: a matrix of FDR-adjusted p-values for the outlier test run on each transcript in data.

  • num.outliers: a vector giving the number of outliers detected for each transcript based on the threshold.

  • outlier.test.results.list: a list of length max(num.outliers) + 1 containing entries roundN, where N is between one and max(num.outliers) + 1. roundN is the data frame of results for the outlier test after excluding the (N-1)th outlier sample, with round1 being for the original data set (i.e., before excluding any outlier samples).

  • distributions: a numeric vector indicating the optimal distribution for each transcript. Possible values are 1 (normal), 2 (log-normal), 3 (exponential), and 4 (gamma).

  • initial.screen.method: Specifies the statistical criterion for initial feature selection. Valid options are 'p-value' and 'FDR' (p-value used by default).

Examples

data(outliers);
outliers.subset <- outliers[1:10,];
results <- detect.outliers(
   data = outliers.subset,
   num.null = 10
   );

example.data.for.calculate.p.values

Description

Example data (list object) for testing calculate.p.values().

Usage

example.data.for.calculate.p.values

Format

An object of class list of length 13.


Identify optimal distribution of data

Description

Identify which of four distributions—normal, log-normal, exponential, or gamma—best fits the given data according to BIC.

Usage

## S3 method for class 'bic.optimal.data.distribution'
identify(x)

Arguments

x

A numeric vector.

Value

A numeric code representing which distribution optimally fits x. Possible values are

  • 1 = normal,

  • 2 = log-normal,

  • 3 = exponential, and

  • 4 = gamma.

Examples

# Generate fake data.
set.seed(1234);
x <- rgamma(
    n = 20,
    shape = 2,
    scale = 2
    );
identify.bic.optimal.data.distribution(
    x = x
    );

Identify optimal distribution of residuals

Description

Identify which of four distributions—normal, log-normal, exponential, or gamma—best fits the given vector of residuals according to BIC.

Usage

## S3 method for class 'bic.optimal.residuals.distribution'
identify(x)

Arguments

x

A numeric vector.

Value

A numeric code representing which distribution optimally fits x. Possible values are

  • 1 = normal,

  • 2 = log-normal,

  • 3 = exponential, and

  • 4 = gamma.

Examples

# Generate fake data.
set.seed(1234);
x <- rgamma(
    n = 20,
    shape = 2,
    scale = 2
    );
identify.bic.optimal.residuals.distribution(
    x = x
    );

k-means fraction

Description

Given a vector of cluster assigments from quantify.outliers() run with method = 'kmeans', compute the fraction of observations belonging to the smaller of the two clusters.

Usage

kmeans.fraction(x)

Arguments

x

A numeric vector.

Details

This function only considers clusters 1 and 2 even if quantify.outliers() was run with exclude.zero = TRUE. In that case, zeros are effectively excluded from the counts used to define the k-means fraction. See examples.

Value

A number.

Examples

x <- c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2);
names(x) <- letters[1:length(x)];
kmeans.fraction(x);

Cosine similarity

Description

Compute cosine similarity for detection of outliers. Generate theoretical quantiles based on the optimal distribution of the data, and compute cosine similarity between a point made up of the largest observed quantile and the largest theoretical quantile and a point on the line y = x. .

Usage

outlier.detection.cosine(x, distribution)

Arguments

x

A numeric vector.

distribution

A numeric code corresponding to the optimal distribution of x as returned by identify.bic.optimal.data.distribution().

Value

A number.

Examples

# Generate fake data.
set.seed(1234);
x <- rgamma(
    n = 20,
    shape = 2,
    scale = 2
    );
outlier.detection.cosine(
    x = x,
    distribution = 4
    );

Example data set for outlier testing

Description

Example data set for outlier testing

Usage

outliers

Format

A data frame with 500 rows and 50 columns:

S01

simulated fragments per kilobase of transcript per million fragments mapped (FPKM) values for sample 1

S02

simulated FPKM values for sample 2

S03

simulated FPKM values for sample 3

S04

simulated FPKM values for sample 4

S05

simulated FPKM values for sample 5

S06

simulated FPKM values for sample 6

S07

simulated FPKM values for sample 7

S08

simulated FPKM values for sample 8

S09

simulated FPKM values for sample 9

S10

simulated FPKM values for sample 10

S11

simulated FPKM values for sample 11

S12

simulated FPKM values for sample 12

S13

simulated FPKM values for sample 13

S14

simulated FPKM values for sample 14

S15

simulated FPKM values for sample 15

S16

simulated FPKM values for sample 16

S17

simulated FPKM values for sample 17

S18

simulated FPKM values for sample 18

S19

simulated FPKM values for sample 19

S20

simulated FPKM values for sample 20

S21

simulated FPKM values for sample 21

S22

simulated FPKM values for sample 22

S23

simulated FPKM values for sample 23

S24

simulated FPKM values for sample 24

S25

simulated FPKM values for sample 25

S26

simulated FPKM values for sample 26

S27

simulated FPKM values for sample 27

S28

simulated FPKM values for sample 28

S29

simulated FPKM values for sample 29

S30

simulated FPKM values for sample 30

S31

simulated FPKM values for sample 31

S32

simulated FPKM values for sample 32

S33

simulated FPKM values for sample 33

S34

simulated FPKM values for sample 34

S35

simulated FPKM values for sample 35

S36

simulated FPKM values for sample 36

S37

simulated FPKM values for sample 37

S38

simulated FPKM values for sample 38

S39

simulated FPKM values for sample 39

S40

simulated FPKM values for sample 40

S41

simulated FPKM values for sample 41

S42

simulated FPKM values for sample 42

S43

simulated FPKM values for sample 43

S44

simulated FPKM values for sample 44

S45

simulated FPKM values for sample 45

S46

simulated FPKM values for sample 46

S47

simulated FPKM values for sample 47

S48

simulated FPKM values for sample 48

S49

simulated FPKM values for sample 49

S50

simulated FPKM values for sample 50


Compute quantities for outlier detection

Description

Compute quantities for use in the detection of outliers. Specifically, compute z-scores based on the mean / standard deviation, the trimmed mean / trimmed standard deviation, or the median / median absolute deviation, or the cluster assignment from k-means with two clusters.

Usage

quantify.outliers(
  x,
  method = "mean",
  trim = 0,
  nstart = 1,
  exclude.zero = FALSE
)

Arguments

x

A numeric vector.

method

A string indicating the quantities to be computed. Possible values are

  • 'mean' : z-scores based on mean and standard deviation or trimmed mean and trimmed standard deviation if trim > 0,

  • 'median' : z-scores based on median and median absolute deviation, or

  • 'kmeans' : cluster assignment from k-means with two clusters. The default is z-scores based on the mean and standard deviation.

trim

A number, the fraction of observations to be trimmed from each end of x. Default is no trimming.

nstart

A number, for k-means clustering, the number of random initial centers for the clusters. Default is 1. See stats::kmeans() for further information.

exclude.zero

A logical, whether zeros should be excluded (TRUE) or not excluded (FALSE, the default) from computations. For method = 'mean' and method = 'median', this means zeros will not be included in computing the summary statistics; for method = 'kmeans', this means zeros will be placed in their own cluster, coded 0.

Value

A numeric vector the same size as x whose values are the requested quantities computed on the corresponding elements of x.

Examples

# Generate fake data.
set.seed(1234);
x <- rgamma(
    n = 20,
    shape = 2,
    scale = 2
    );
# Add missing values and zeros for demonstration.  Missing values are
# ignored, and zeros can be ignored with `exclude.zeros = TRUE`.
x[1:5] <- NA;
x[6:10] <- 0;

# Compute z-scores based on mean and standard deviation.
quantify.outliers(
    x = x,
    method = 'mean',
    trim = 0
    );
# Exclude zeros from the calculation of the mean and standard
# deviation.
quantify.outliers(
    x = x,
    method = 'mean',
    trim = 0,
    exclude.zero = TRUE
    );

# Compute z-scores based on the 5% trimmed mean and 5% trimmed
# standard deviation.
quantify.outliers(
    x = x,
    method = 'mean',
    trim = 0.05
    );

# Compute z-scores based on the median and median absolute deviation.
quantify.outliers(
    x = x,
    method = 'median'
    );

# Compute cluster assignments using k-means with k = 2.
quantify.outliers(
    x = x,
    method = 'kmeans'
    );
# Try different initial cluster assignments.
quantify.outliers(
    x = x,
    method = 'kmeans',
    nstart = 10
    );
# Assign zeros to their own cluster.
quantify.outliers(
    x = x,
    method = 'kmeans',
    exclude.zero = TRUE
    );

Simulate from a null distribution

Description

Simulate transcripts from a specified null distribution.

Usage

## S3 method for class 'null'
simulate(x, x.distribution, r, r.distribution)

Arguments

x

A numeric vector of transcripts.

x.distribution

A numeric code corresponding to the optimal distribution of x as returned by identify.bic.optimal.data.distribution(). Possible values are

  • 1 = normal,

  • 2 = log-normal,

  • 3 = exponential, and

  • 4 = gamma.

r

A numeric vector of residuals calculated for this transcript.

r.distribution

A numeric code corresponding to the optimal distribution of x as returned by identify.bic.optimal.residuals.distribution(). Possible values are the same as those for x.distribution.

Value

A numeric vector of the same length as x. Names are not retained.

Examples

# Prepare fake data.
set.seed(1234);
x <- rgamma(
    n = 20,
    shape = 2,
    scale = 2
    );
names(x) <- paste('Sample', seq_along(x), sep = '.');
x.dist <- identify.bic.optimal.data.distribution(
    x = x
    );
r <- calculate.residuals(
    x = x,
    distribution = x.dist
    );
r.trimmed <- trim.sample(
    x = r
    );
r.dist <- identify.bic.optimal.residuals.distribution(
    x = r.trimmed
    );
null <- simulate.null(
    x = x,
    x.distribution = x.dist,
    r = r.trimmed,
    r.distribution = r.dist
    );

Trim a vector of numbers

Description

Symmetrically trim a vector of numbers after sorting it.

Usage

trim.sample(x, trim = 0.05)

Arguments

x

A numeric vector.

trim

A number, the fraction of observations to be trimmed from each end of x.

Details

If length(x) <= 10, the function returns x[2:(length(x) - 1)].

Value

A sorted, trimmed copy of x.

Examples

trim.sample(
    x = 1:20,
    trim = 0.05
    );

Range of z-scores

Description

Compute the range of a vector of z-scores.

Usage

zrange(x)

Arguments

x

A numeric vector

Value

A number.

Examples

set.seed(1234);
x <- rnorm(
    n = 10
    );
zrange(
    x = x
    );