Package 'OmicsQC' reference manual

Package 'OmicsQC'

Title:	Nominating Quality Control Outliers in Genomic Profiling Studies
Description:	A method that analyzes quality control metrics from multi-sample genomic sequencing studies and nominates poor quality samples for exclusion. Per sample quality control data are transformed into z-scores and aggregated. The distribution of aggregated z-scores are modelled using parametric distributions. The parameters of the optimal model, selected either by goodness-of-fit statistics or user-designation, are used for outlier nomination. Two implementations of the Cosine Similarity Outlier Detection algorithm are provided with flexible parameters for dataset customization.
Authors:	Anders Hugo Frelin [aut], Helen Zhu [aut], Paul C. Boutros [aut, cre]
Maintainer:	Paul C. Boutros <[email protected]>
License:	GPL-2
Version:	1.1.0
Built:	2025-03-29 04:01:32 UTC
Source:	https://github.com/cran/OmicsQC

Title:

Nominating Quality Control Outliers in Genomic Profiling Studies

Description:

A method that analyzes quality control metrics from multi-sample genomic sequencing studies and nominates poor quality samples for exclusion. Per sample quality control data are transformed into z-scores and aggregated. The distribution of aggregated z-scores are modelled using parametric distributions. The parameters of the optimal model, selected either by goodness-of-fit statistics or user-designation, are used for outlier nomination. Two implementations of the Cosine Similarity Outlier Detection algorithm are provided with flexible parameters for dataset customization.

Authors:

Anders Hugo Frelin [aut], Helen Zhu [aut], Paul C. Boutros [aut, cre]

Maintainer:

Paul C. Boutros <[email protected]>

License:

GPL-2

Version:

1.1.0

Built:

2025-03-29 04:01:32 UTC

Source:

https://github.com/cran/OmicsQC

Help Index

Sum across sign corrected z-scores for total sample quality score

Description

This function takes a dataframe of all the sign corrected scores, thus all negative, and aggregates to get a total sample quality score.

Usage

accumulate.zscores(zscores.corrected, filename = NULL)
accumulate.zscores(zscores.corrected, filename = NULL)

Arguments

`zscores.corrected`	A dataframe whose rows are samples and each column a QC metric
`filename`	A filename where to save data. If NULL data will not be saved to file

Value

A dataframe of aggregated z-scores for each sample

Sample: Sample IDs defined by the rownames of zscores.corrected
Sum: Sum of z-scores

Corrects the z-scores signs according to the metrics

Description

For some metrics a high z-score is good, while for others a low one is good. This function corrects for that so that a negative z-score is a poor score for every metric. It then sets all positive scores to zero.

Usage

correct.zscore.signs(
  zscores,
  signs.data,
  metric.col.name = "Metric",
  signs.col.name = "Sign",
  filename = NULL
)
correct.zscore.signs(
  zscores,
  signs.data,
  metric.col.name = "Metric",
  signs.col.name = "Sign",
  filename = NULL
)

Arguments

`zscores`	A dataframe whose rows are samples and each column a QC metric, entries are z-scores
`signs.data`	A dataframe of two columns, the metric names and the sign of the metric
`metric.col.name`	The name of the column in signs.data that stores the metric name
`signs.col.name`	The name of the column in signs.data that stores sign as 'neg' or 'pos'
`filename`	A filename where to save data. If NULL data will not be saved to file

Value

A dataframe whose rows are the QC metrics, and columns are samples with the z-scores if they are negative

Calculate an outlier cutoff using cosine similarity

Description

This function takes quality.scores, trims it and fits it to the distribution given. It then simulates as many datasets as stated by no.simulations, and computes the cosine similarity of each dataset against theoretical distribution. It uses what would correspond to a significant value to then calculate what observed value this would correspond to. The function supports the following distributions:

'weibull'
'norm'
'gamma'
'exp'
'lnorm'
'cauchy'
'logis'

Usage

cosine.similarity.cutoff(
  quality.scores,
  no.simulations,
  distribution = c("lnorm", "weibull", "norm", "gamma", "exp", "cauchy", "logis"),
  trim.factor = 0.05,
  alpha.significant = 0.05
)
cosine.similarity.cutoff(
  quality.scores,
  no.simulations,
  distribution = c("lnorm", "weibull", "norm", "gamma", "exp", "cauchy", "logis"),
  trim.factor = 0.05,
  alpha.significant = 0.05
)

Arguments

`quality.scores`	A dataframe with columns 'Sum' (of scores) and 'Sample', i.e. the output of accumulate.zscores
`no.simulations`	The number of datasets to simulate
`distribution`	A distribution to test, will default to 'lnorm'
`trim.factor`	What fraction of values of each to trim to get parameters without using extremes
`alpha.significant`	Alpha value for significance

Value

Results in the form of a named list

cutoff: Computed cutoff for aggregated z-scores used as a threshold for nominating outliers
no.outliers: Number of nominated outliers
outlier.labels: Outlier IDs, corresponding to Sample column of quality.scores

Tests the accumulated quality scores for outliers using cosine similarity

Description

This function takes quality.scores, trims it and fits it to the distribution given. It then iteratively tests the largest datapoint compared a null distribution of size no.simulations. If the largest datapoint has a significant p-value it tests the 2nd largest one and so on. The function supports the following distributions:

'weibull'
'norm'
'gamma'
'exp'
'lnorm'
'cauchy'
'logis'

Usage

cosine.similarity.iterative(
  quality.scores,
  no.simulations,
  distribution = c("lnorm", "weibull", "norm", "gamma", "exp", "cauchy", "logis"),
  trim.factor = 0.05,
  alpha.significant = 0.05
)
cosine.similarity.iterative(
  quality.scores,
  no.simulations,
  distribution = c("lnorm", "weibull", "norm", "gamma", "exp", "cauchy", "logis"),
  trim.factor = 0.05,
  alpha.significant = 0.05
)

Arguments

`quality.scores`	A dataframe with columns 'Sum' (of scores) and 'Sample', i.e. the output of accumulate.zscores
`no.simulations`	The number of datasets to simulate
`distribution`	A distribution to test, will default to 'lnorm'
`trim.factor`	What fraction of values of each to trim to get parameters without using extremes
`alpha.significant`	Alpha value for significance

Value

Results in the form of a named list

no.outliers: Number of nominated outliers
outlier.labels: Outlier IDs, corresponding to Sample column of quality.scores

QC metrics across 100 samples

Description

QC metrics across 100 samples

Usage

data(example.qc.dataframe)
data(example.qc.dataframe)

Format

A data frame containing QC data; columns represent QC metrics and rows represent samples

Examples


data(example.qc.dataframe)
zscores.from.metrics(
  qc.data = example.qc.dataframe
  );

data(example.qc.dataframe)
zscores.from.metrics(
  qc.data = example.qc.dataframe
  );

Fits the QC data to distributions and returns the KS test result and BIC score

Description

This function takes the accumulated QC scores, a vector of distributions and a trimming factor. It then returns the results for each distribution in a dataframe. This function supports the following distributions:

'weibull'
'norm'
'gamma'
'exp'
'lnorm'
'cauchy'
'logis'

Usage

fit.and.evaluate(
  quality.scores,
  distributions = c("weibull", "norm", "gamma", "exp", "lnorm", "cauchy", "logis"),
  trim.factor = 0.05
)
fit.and.evaluate(
  quality.scores,
  distributions = c("weibull", "norm", "gamma", "exp", "lnorm", "cauchy", "logis"),
  trim.factor = 0.05
)

Arguments

`quality.scores`	The accumulated QC scores, the output of accumulate.zscores
`distributions`	A vector of distributions to fit and test
`trim.factor`	The fraction of extremes on each end to trim before fitting

Value

A dataframe of the results with the following columns

distribution: Name of the fitted distribution
KS.rejected: Whether the Kolmogorov-Smirnov test rejects the fit; see fitdistrplus::gofstat - kstest
BIC.value: Bayesian Information Criterion

Generates the standard barplot of scores for each sample

Description

This function takes the quality score data generated by accumulate.zscores() and returns the barplot if no filename is specified. If filename is is specified it saves the plot as file and returns NULL. It will also draw a cut-off for which samples to exclude. get.qc.barplot offers a standard template for generating a QC barplot, but can also take any parameter that BoutrosLab.plotting.general::create.barplot takes for more customizability.

Usage

get.qc.barplot(
  quality.scores,
  filename = NULL,
  abline.h = -20,
  yaxis.cex = 0.8,
  xaxis.cex = 0,
  yaxis.tck = 1,
  xaxis.tck = 0,
  xlab.label = "",
  ylab.label = "Sum of Z (Z < 0)",
  ylab.cex = 1,
  abline.col = "darkgrey",
  axes.lwd = 1,
  ...
)
get.qc.barplot(
  quality.scores,
  filename = NULL,
  abline.h = -20,
  yaxis.cex = 0.8,
  xaxis.cex = 0,
  yaxis.tck = 1,
  xaxis.tck = 0,
  xlab.label = "",
  ylab.label = "Sum of Z (Z < 0)",
  ylab.cex = 1,
  abline.col = "darkgrey",
  axes.lwd = 1,
  ...
)

Arguments

`quality.scores`	A dataframe with columns 'Sum' (of scores) and 'Sample', i.e. the output of accumulate.zscores
`filename`	Filename for tiff output, or if NULL returns the trellis object itself
`abline.h`	Adds a horizontal line to the plot; useful for depicting the threshold for what is deemed a poor sample quality score
`yaxis.cex`	Size of y-axis tick labels, defaults to 0.8
`xaxis.cex`	Size of x-axis tick labels, defaults to 0
`yaxis.tck`	Specifies the length of the tick marks for y-axis, defaults to 1
`xaxis.tck`	Specifies the length of the tick marks for x-axis, defaults to 0
`xlab.label`	The label for the x-axis, defaults to ”
`ylab.label`	label for the y-axis, defaults to 'Sum of Z (Z < 0)'
`ylab.cex`	Size of y-axis label, defaults to 1
`abline.col`	Colour of the horizontal line on the plot, defaults to 'darkgrey'
`axes.lwd`	Specify line width of the axes; set to 0 to turn off axes
`...`	The function can also take any parameter that BoutrosLab.plotting.general::create.barplot takes

Value

The barplot or NULL depending if filename is specified

Generates the standard heatmap of scores for each sample.

Description

This function takes the the scores for each sample and each metric, after being sign-corrected, and returns the standard heatmap, if filename is NULL. If filename is not NULL it saves the heatmap to file and returns NULL. The function also takes quality.scores to make sure the samples are ordered correctly, as well as the y labels for the quality metrics. get.qc.heatmap offers a standard template for generating a QC heatmap, but can also take any parameter that BoutrosLab.plotting.general::create.barplot takes for customisability.

Usage

get.qc.heatmap(
  zscores,
  quality.scores,
  yaxis.lab = colnames(zscores),
  xaxis.lab = quality.scores[, "Sample"],
  filename = NULL,
  yaxis.cex = 0.8,
  xaxis.cex = 0,
  xlab.cex = 1,
  xlab.label = "Samples",
  clustering.method = "none",
  colour.scheme = c("red", "white"),
  colour.centering.value = 0,
  colourkey.labels.at = c(-10:0),
  colourkey.cex = 1,
  at = seq(0, -10, -2),
  same.as.matrix = TRUE,
  row.lines = seq(1, ncol(zscores), 1) + 0.5,
  grid.row = TRUE,
  row.colour = "black",
  row.lwd = 1,
  axes.lwd = 1,
  ...
)
get.qc.heatmap(
  zscores,
  quality.scores,
  yaxis.lab = colnames(zscores),
  xaxis.lab = quality.scores[, "Sample"],
  filename = NULL,
  yaxis.cex = 0.8,
  xaxis.cex = 0,
  xlab.cex = 1,
  xlab.label = "Samples",
  clustering.method = "none",
  colour.scheme = c("red", "white"),
  colour.centering.value = 0,
  colourkey.labels.at = c(-10:0),
  colourkey.cex = 1,
  at = seq(0, -10, -2),
  same.as.matrix = TRUE,
  row.lines = seq(1, ncol(zscores), 1) + 0.5,
  grid.row = TRUE,
  row.colour = "black",
  row.lwd = 1,
  axes.lwd = 1,
  ...
)

Arguments

`zscores`	A dataframe of (sign-corrected) z-scores for each sample and test metric, i.e. the output of correct.zscore.signs
`quality.scores`	A dataframe with columns 'Sum' (of scores) and 'Sample', i.e. the output of accumulate.zscores
`yaxis.lab`	A vector of metric labels for the y-axis; defaults to column names of z-scores
`xaxis.lab`	A vector of sample labels for the x-axis; defaults to ordered Sample column elements in quality.scores
`filename`	Filename for tiff output, or if NULL returns the trellis object itself
`yaxis.cex`	Size of y-axis tick labels, defaults to 0.8
`xaxis.cex`	Size of x-axis tick labels, defaults to 0
`xlab.cex`	Size of x-axis label, defaults to 1
`xlab.label`	The label for the x-axis, defaults so 'Sample'
`clustering.method`	Method used to cluster the records – “none” gives unclustered data. Accepts all agglomerative clustering methods available in hclust, plus “diana” (which is divisive).
`colour.scheme`	Heatmap colouring. Accepts old-style themes, or a vector of either two or three colours that are gradiated to create the final palette.
`colour.centering.value`	What should be the center of the colour-map
`colourkey.labels.at`	A vector specifying the tick-positions on the colourkey
`colourkey.cex`	Size of colourkey label text
`at`	A vector specifying the breakpoints along the range of x; each interval specified by these breakpoints are assigned to a colour from the palette. Defaults to seq(0, -10, -2), to give a clear discrete display of colours. If x has values outside of the range specified by “at” those values are shown with the colours corresponding to the extreme ends of the colour spectrum and a warning is given.
`same.as.matrix`	Prevents the flipping of the matrix that the function normally does
`row.lines`	Vector specifying location of lines, default is seq(1, ncol(x), 1) + 0.5. Note: Add 0.5 to customized vector
`grid.row`	Allow turning off of the interior grid-lines. Default is TRUE.
`row.colour`	Interior grid-line colour, defaults to “black”. Can be a vector
`row.lwd`	Interior grid-line width, defaults to 1. Setting to zero is equivalent to grid.row = FALSE and grid.col = FALSE. Can be a vector.
`axes.lwd`	Width of heatmap border. Note it also changes the colourkey border and ticks
`...`	The function can also take any parameter that BoutrosLab.plotting.general::create.heatmap takes

Value

The heatmap or NULL depending if filename is specified

Generates the multipanel plot of heatmap and barplot

Description

This function takes the barplot and heatmap and returns the multipanel plot of the two.

Usage

get.qc.multipanelplot(
  barplot,
  heatmap,
  filename = NULL,
  width = 10,
  height = 8,
  layout.height = 2,
  layout.width = 1,
  plot.objects.heights = c(1, 3),
  y.spacing = -1,
  ylab.axis.padding = -19,
  left.padding = 8,
  main = "QC Summary",
  main.cex = 1,
  ...
)
get.qc.multipanelplot(
  barplot,
  heatmap,
  filename = NULL,
  width = 10,
  height = 8,
  layout.height = 2,
  layout.width = 1,
  plot.objects.heights = c(1, 3),
  y.spacing = -1,
  ylab.axis.padding = -19,
  left.padding = 8,
  main = "QC Summary",
  main.cex = 1,
  ...
)

Arguments

`barplot`	A barplot of the samples aggregated score
`heatmap`	A heatmap of the sign-corrected scores for each sample
`filename`	Filename to output to
`width`	Width of resulting file
`height`	Height of resulting file
`layout.height`	how many plots per column
`layout.width`	how many plots per row.
`plot.objects.heights`	Heights of each row of the plot. Must be vector of same size as layout.height
`y.spacing`	vertical spacing between each plot. Can be single value or vector of length layout.height - 1
`ylab.axis.padding`	padding between axis and y label of plots. Can be single value or vector of length layout.width
`left.padding`	padding from the left side of the frame
`main`	main label text
`main.cex`	main label cex
`...`	The function can also take any parameter that BoutrosLab.plotting.general::create.multipanelplot takes

Value

The multipanelplot or NULL depending if filename is specified

Directionality of QC metrics

Description

Directionality of QC metrics

Usage

data(sign.correction)
data(sign.correction)

Format

A data frame containing the following columns: Metric, Sign

Metric: Quality control metrics, corresponding to the metrics in example.qc.dataframe
Sign: Directionality of each metric; positive (pos) means a higher metric is better, negative (neg) means a lower metric is better

Examples


data(sign.correction)
data(example.qc.dataframe)
correct.zscore.signs(
  zscores = example.qc.dataframe,
  signs.data = sign.correction,
  metric.col.name = 'Metric',
  signs.col.name = 'Sign',
  );

data(sign.correction)
data(example.qc.dataframe)
correct.zscore.signs(
  zscores = example.qc.dataframe,
  signs.data = sign.correction,
  metric.col.name = 'Metric',
  signs.col.name = 'Sign',
  );

Formatted QC metrics labels

Description

Formatted QC metrics labels

Usage

data(ylabels)
data(ylabels)

Format

A character vector of formatted QC metric labels

Examples


data(ylabels)
data(example.qc.dataframe)
data(sign.correction)
zscores <- zscores.from.metrics(qc.data = example.qc.dataframe);
zscores.corrected <- correct.zscore.signs(
  zscores = zscores,
  signs.data = sign.correction,
  metric.col.name = 'Metric',
  signs.col.name = 'Sign'
);
quality.scores <- accumulate.zscores(zscores.corrected = zscores.corrected);
qc.heatmap <- get.qc.heatmap(
  zscores = zscores.corrected,
  quality.scores = quality.scores,
  yaxis.lab = ylabels
);

data(ylabels)
data(example.qc.dataframe)
data(sign.correction)
zscores <- zscores.from.metrics(qc.data = example.qc.dataframe);
zscores.corrected <- correct.zscore.signs(
  zscores = zscores,
  signs.data = sign.correction,
  metric.col.name = 'Metric',
  signs.col.name = 'Sign'
);
quality.scores <- accumulate.zscores(zscores.corrected = zscores.corrected);
qc.heatmap <- get.qc.heatmap(
  zscores = zscores.corrected,
  quality.scores = quality.scores,
  yaxis.lab = ylabels
);

Calculate z-scores for each metric across each sample

Description

This function takes a dataframe of QC metrics, and calculates the the z-scores. If filename is specified, the results will be saved to file.

Usage

zscores.from.metrics(qc.data, filename = NULL)
zscores.from.metrics(qc.data, filename = NULL)

Arguments

`qc.data`	A dataframe whose rows are samples and each column a QC metric
`filename`	A filename where to save data. If NULL data will not be saved to file

Value

A dataframe of z-scores; rows correspond to samples and columns correspond to metrics from qc.data