Title: | Utilities for the Application of a Polygenic Score to a VCF |
---|---|
Description: | Simple and transparent parsing of genotype/dosage data from an input Variant Call Format (VCF) file, matching of genotype coordinates to the component Single Nucleotide Polymorphisms (SNPs) of an existing polygenic score (PGS), and application of SNP weights to dosages for the calculation of a polygenic score for each individual in accordance with the additive weighted sum of dosages model. Methods are designed in reference to best practices described by Collister, Liu, and Clifton (2022) <doi:10.3389/fgene.2022.818574>. |
Authors: | Paul Boutros [cre], Nicole Zeltser [aut]
|
Maintainer: | Paul Boutros <[email protected]> |
License: | GPL-2 |
Version: | 3.0.1 |
Built: | 2025-03-06 06:00:09 UTC |
Source: | https://github.com/cran/ApplyPolygenicScore |
Apply a polygenic score to VCF data.
apply.polygenic.score( vcf.data, pgs.weight.data, phenotype.data = NULL, phenotype.analysis.columns = NULL, correct.strand.flips = TRUE, remove.ambiguous.allele.matches = FALSE, remove.mismatched.indels = FALSE, output.dir = NULL, file.prefix = NULL, missing.genotype.method = "mean.dosage", use.external.effect.allele.frequency = FALSE, n.percentiles = NULL, analysis.source.pgs = NULL, validate.inputs.only = FALSE )
apply.polygenic.score( vcf.data, pgs.weight.data, phenotype.data = NULL, phenotype.analysis.columns = NULL, correct.strand.flips = TRUE, remove.ambiguous.allele.matches = FALSE, remove.mismatched.indels = FALSE, output.dir = NULL, file.prefix = NULL, missing.genotype.method = "mean.dosage", use.external.effect.allele.frequency = FALSE, n.percentiles = NULL, analysis.source.pgs = NULL, validate.inputs.only = FALSE )
vcf.data |
A data.frame containing VCF genotype data as formatted by |
pgs.weight.data |
A data.frame containing PGS weight data as formatted by |
phenotype.data |
A data.frame containing phenotype data. Must have an Indiv column matching vcf.data. Default is |
phenotype.analysis.columns |
A character vector of phenotype columns from phenotype.data to analyze in a regression analsyis. Default is |
correct.strand.flips |
A logical indicating whether to check PGS weight data/VCF genotype data matches for strand flips and correct them. Default is |
remove.ambiguous.allele.matches |
A logical indicating whether to remove PGS variants with ambiguous allele matches between PGS weight data and VCF genotype data. Default is |
remove.mismatched.indels |
A logical indicating whether to remove indel variants that are mismatched between PGS weight data and VCF genotype data. Default is |
output.dir |
A character string indicating the directory to write output files. Separate files are written for per-sample pgs results and optional regression results. Files are tab-separate .txt files. Default is NULL in which case no files are written. |
file.prefix |
A character string to prepend to the output file names. Default is |
missing.genotype.method |
A character string indicating the method to handle missing genotypes. Options are "mean.dosage", "normalize", or "none". Default is "mean.dosage". |
use.external.effect.allele.frequency |
A logical indicating whether to use an external effect allele frequency for calculating mean dosage when handling missing genotypes. Default is |
n.percentiles |
An integer indicating the number of percentiles to calculate for the PGS. Default is |
analysis.source.pgs |
A character string indicating the source PGS for percentile calculation and regression analyses. Options are "mean.dosage", "normalize", or "none".
When not specified, defaults to |
validate.inputs.only |
A logical indicating whether to only perform input data validation checks without running PGS application.
If no errors are triggered, a message is printed and TRUE is returned. Default is |
A list containing per-sample PGS output and per-phenotype regression output if phenotype analysis columns are provided.
Output Structure
The outputed list contains the following elements:
pgs.output: A data.frame containing the PGS per sample and optional phenotype data.
regression.output: A data.frame containing the results of the regression analysis if phenotype.analysis.columns are provided, otherwise NULL
.
pgs.output columns:
Indiv
: A character string indicating the sample ID.
PGS
: A numeric vector indicating the PGS per sample. (only if missing.genotype.method includes "none")
PGS.with.normalized.missing
: A numeric vector indicating the PGS per sample with missing genotypes normalized. (only if missing.genotype.method includes "normalize")
PGS.with.replaced.missing
: A numeric vector indicating the PGS per sample with missing genotypes replaced by mean dosage. (only if missing.genotype.method includes "mean.dosage")
percentile
: A numeric vector indicating the percentile rank of the PGS.
decile
: A numeric vector indicating the decile rank of the PGS.
quartile
: A numeric vector indicating the quartile rank of the PGS.
percentile.X:
A numeric vector indicating the user-specified percentile rank of the PGS where "X" is substituted by n.percentiles
. (only if n.percentiles
is specified)
n.missing.genotypes
: A numeric vector indicating the number of missing genotypes per sample.
percent.missing.genotypes
: A numeric vector indicating the percentage of missing genotypes per sample.
All columns in phenotype.data
if provided.
regression.output columns:
phenotype: A character vector of phenotype names.
model
: A character vector indicating the regression model used. One of "logistic.regression" or "linear.regression".
beta
: A numeric vector indicating the beta coefficient of the regression analysis.
se
: A numeric vector indicating the standard error of the beta coefficient.
p.value
: A numeric vector indicating the p-value of the beta coefficient.
r.squared
: A numeric vector indicating the r-squared value of linear regression analysis. NA for logistic regression.
AUC
: A numeric vector indicating the area under the curve of logistic regression analysis. NA for linear regression.
PGS Calculation
PGS for each individual i is calculated as the sum of the product of the dosage and beta coefficient for each variant in the PGS:
Where m is a PGS component variant out of a total M variants.
Missing Genotype Handling
VCF genotype data are matched to PGS data by chromosome and position. If a SNP cannot be matched by genomic coordinate, an attempt is made to match by rsID (if available). If a SNP from the PGS weight data is not found in the VCF data after these two matching attempts, it is considered a cohort-wide missing variant.
Missing genotypes (in individual samples) among successfully matched variants are handled by three methods:
none
: Missing genotype dosages are excluded from the PGS calculation.
This is equivalent to assuming that all missing genotypes are homozygous for the non-effect allele, resulting in a dosage of 0.
normalize
: Missing genotypes are excluded from score calculation but the final score is normalized by the number of non-missing alleles.
The calculation assumes a diploid genome:
Where P is the ploidy and has the value 2
and is the number of non-missing genotypes.
mean.dosage
: Missing genotype dosages are replaced by the mean population dosage of the variant which is calculated as the product of the effect allele frequency EAF and the ploidy of a diploid genome:
where k is a PGS component variant that is missing in between 1 and n-1 individuals in the cohort and P = ploidy = 2
This dosage calculation holds under assumptions of Hardy-Weinberg equilibrium.
By default, the effect allele frequency is calculated from the provided VCF data.
For variants that are missing in all individuals (cohort-wide), dosage is assumed to be zero (homozygous non-reference) for all individuals.
An external allele frequency can be provided in the pgs.weight.data
as a column named allelefrequency_effect
and by setting use.external.effect.allele.frequency
to TRUE
.
Multiallelic Site Handling
If a PGS weight file provides weights for multiple effect alleles, the appropriate dosage is calculated for the alleles that each individual carries. It is assumed that multiallelic variants are encoded in the same row in the VCF data. This is known as "merged" format. Split multiallelic sites are not accepted. VCF data can be formatted to merged format using external tools for VCF file manipulation.
Allele Mismatch Handling
Variants from the PGS weight data are merged with records in the VCF data by genetic coordinate.
After the merge is complete, there may be cases where the VCF reference (REF) and alternative (ALT) alleles do not match their conventional counterparts in the
PGS weight data (other allele and effect allele, respectively).
This is usually caused by a strand flip: the variant in question was called against opposite DNA reference strands in the PGS training data and the VCF data.
Strand flips can be detected and corrected by flipping the affected allele to its reverse complement.
apply.polygenic.score
uses assess.pgs.vcf.allele.match
to assess allele concordance, and is controlled through the following arguments:
correct.strand.flips
: When TRUE
, detected strand flips are corrected by flipping the affected value in the effect_allele
column prior to dosage calling.
remove.ambiguous.allele.matches
: Corresponds to the return.ambiguous.as.missing
argument in assess.pgs.vcf.allele.match
. When TRUE
, non-INDEL allele
mismatches that cannot be resolved (due to palindromic alleles or causes other than strand flips) are removed by marking the affected value in the effect_allele
column as missing
prior to dosage calling and missing genotype handling. The corresponding dosage is set to NA and the variant is handled according to the chosen missing genotype method.
remove.mismatched.indels
: Corresponds to the return.indels.as.missing
argument in assess.pgs.vcf.allele.match
. When TRUE
, INDEL allele mismatches
(which cannot be assessed for strand flips) are removed by marking the affected value in the effect_allele
column as missing prior to dosage calling and missing genotype handling.
The corresponding dosage is set to NA and the variant is handled according to the chosen missing genotype method.
Note that an allele match assessment requires the presence of both the other_allele
and effect_allele
in the PGS weight data.
The other_allele
column is not required by the PGS Catalog, and so is not always available.
# Example VCF vcf.path <- system.file( 'extdata', 'HG001_GIAB.vcf.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); vcf.import <- import.vcf(vcf.path); # Example pgs weight file pgs.weight.path <- system.file( 'extdata', 'PGS000662_hmPOS_GRCh38.txt.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); pgs.import <- import.pgs.weight.file(pgs.weight.path); pgs.data <- apply.polygenic.score( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data, missing.genotype.method = 'none' ); # Specify different methods for handling missing genotypes pgs.import$pgs.weight.data$allelefrequency_effect <- rep(0.5, nrow(pgs.import$pgs.weight.data)); pgs.data <- apply.polygenic.score( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data, missing.genotype.method = c('none', 'mean.dosage', 'normalize'), use.external.effect.allele.frequency = TRUE ); # Specify allele mismatch handling pgs.data <- apply.polygenic.score( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data, correct.strand.flips = TRUE, remove.ambiguous.allele.matches = TRUE, remove.mismatched.indels = FALSE ); # Provide phenotype data for basic correlation analysis phenotype.data <- data.frame( Indiv = unique(vcf.import$dat$Indiv), continuous.phenotype = rnorm(length(unique(vcf.import$dat$Indiv))), binary.phenotype = sample( c('a', 'b'), length(unique(vcf.import$dat$Indiv)), replace = TRUE ) ); pgs.data <- apply.polygenic.score( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data, phenotype.data = phenotype.data ); # Only run validation checks on input data and report back apply.polygenic.score( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data, validate.inputs.only = TRUE );
# Example VCF vcf.path <- system.file( 'extdata', 'HG001_GIAB.vcf.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); vcf.import <- import.vcf(vcf.path); # Example pgs weight file pgs.weight.path <- system.file( 'extdata', 'PGS000662_hmPOS_GRCh38.txt.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); pgs.import <- import.pgs.weight.file(pgs.weight.path); pgs.data <- apply.polygenic.score( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data, missing.genotype.method = 'none' ); # Specify different methods for handling missing genotypes pgs.import$pgs.weight.data$allelefrequency_effect <- rep(0.5, nrow(pgs.import$pgs.weight.data)); pgs.data <- apply.polygenic.score( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data, missing.genotype.method = c('none', 'mean.dosage', 'normalize'), use.external.effect.allele.frequency = TRUE ); # Specify allele mismatch handling pgs.data <- apply.polygenic.score( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data, correct.strand.flips = TRUE, remove.ambiguous.allele.matches = TRUE, remove.mismatched.indels = FALSE ); # Provide phenotype data for basic correlation analysis phenotype.data <- data.frame( Indiv = unique(vcf.import$dat$Indiv), continuous.phenotype = rnorm(length(unique(vcf.import$dat$Indiv))), binary.phenotype = sample( c('a', 'b'), length(unique(vcf.import$dat$Indiv)), replace = TRUE ) ); pgs.data <- apply.polygenic.score( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data, phenotype.data = phenotype.data ); # Only run validation checks on input data and report back apply.polygenic.score( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data, validate.inputs.only = TRUE );
Assess whether PGS reference and effect alleles match provided VCF reference and alternative alleles. Mismatches are checked for potential switching of effect and reference PGS alleles (cases where the effect allele is the REF VCF allele) and are evaluated for DNA strand flips (by flipping the PGS alleles). INDEL alleles are not supported for strand flip assessment.
assess.pgs.vcf.allele.match( vcf.ref.allele, vcf.alt.allele, pgs.ref.allele, pgs.effect.allele, return.indels.as.missing = FALSE, return.ambiguous.as.missing = FALSE )
assess.pgs.vcf.allele.match( vcf.ref.allele, vcf.alt.allele, pgs.ref.allele, pgs.effect.allele, return.indels.as.missing = FALSE, return.ambiguous.as.missing = FALSE )
vcf.ref.allele |
A character vector of singular VCF reference (REF) alleles. |
vcf.alt.allele |
A character vector of VCF alternative (ALT) alleles. Multiple alleles at a multiallelic site must be separated by commas. |
pgs.ref.allele |
A character vector of singular PGS reference alleles aka "non-effect" or "other" alleles. |
pgs.effect.allele |
A character vector of singular PGS effect alleles. |
return.indels.as.missing |
A logical value indicating whether to return NA for INDEL alleles with detected mismatches. Default is |
return.ambiguous.as.missing |
A logical value indicating whether to return NA for ambiguous cases where both a strand flip and effect switch are possible,
or no strand flip is detected and a mismatch cannot be resolved. Default is |
A list containing the match assessment, a new PGS effect allele, and a new PGS other allele.
Output Structure
The outputed list contains the following elements:
match.status
: A character vector indicating the match status for each pair of allele pairs. Possible values are default_match
, effect_switch
, strand_flip
, effect_switch_with_strand_flip
, ambiguous_flip
, indel_mismatch
, and unresolved_mismatch
.
new.pgs.effect.allele
: A character vector of new PGS effect alleles based on the match status. If the match status is default_match
, effect_switch
or missing_allele
, the original PGS effect allele is returned.
If the match status is strand_flip
or effect_switch_with_strand_flip
the flipped PGS effect allele is returned. If the match status is ambiguous_flip
, indel_mismatch
, or unresolved_mismatch
,
the return value is either the original allele or NA as dictated by the return.indels.as.missing
and return.ambiguous.as.missing
parameters.
new.pgs.other.allele
: A character vector of new PGS other alleles based on the match status, following the same logic as new.pgs.effect.allele
.
The match.status output indicates the following:
default_match
: The default PGS reference allele matches the VCF REF allele and the default PGS effect allele matches one of the VCF ALT alleles.
effect_switch
: The PGS effect allele matches the VCF REF allele and the PGS reference allele matches one of the VCF ALT alleles.
strand_flip
: The PGS reference and effect alleles match their respective VCF pairs when flipped.
effect_switch_with_strand_flip
: The PGS effect allele matches the VCF REF allele and the PGS reference allele matches one of the VCF ALT alleles when flipped.
ambiguous_flip
: Both an effect switch and a strand flip have been detected. This is an ambiguous case caused by palindromic SNPs.
indel_mismatch
: A mismatch was detected between pairs of alleles where at least one was an INDEL. INDEL alleles are not supported for strand flip assessment.
unresolved_mismatch
: A mismatch was detected between pairs of non-INDEL alleles that could not be resolved by an effect switch or flipping the PGS alleles.
missing_allele
: One of the four alleles is missing, making it impossible to assess the match.
# Example data demonstrating the following cases in each vector element: # 1. no strand flips # 2. effect allele switch # 3. strand flip # 4. effect allele switch AND strand flip # 5. palindromic (ambiguous) alleles # 6. unresolved mismatch vcf.ref.allele <- c('A', 'A', 'A', 'A', 'A', 'A'); vcf.alt.allele <- c('G', 'G', 'G', 'G', 'T', 'G'); pgs.ref.allele <- c('A', 'G', 'T', 'C', 'T', 'A'); pgs.effect.allele <- c('G', 'A', 'C', 'T', 'A', 'C'); assess.pgs.vcf.allele.match(vcf.ref.allele, vcf.alt.allele, pgs.ref.allele, pgs.effect.allele);
# Example data demonstrating the following cases in each vector element: # 1. no strand flips # 2. effect allele switch # 3. strand flip # 4. effect allele switch AND strand flip # 5. palindromic (ambiguous) alleles # 6. unresolved mismatch vcf.ref.allele <- c('A', 'A', 'A', 'A', 'A', 'A'); vcf.alt.allele <- c('G', 'G', 'G', 'G', 'T', 'G'); pgs.ref.allele <- c('A', 'G', 'T', 'C', 'T', 'A'); pgs.effect.allele <- c('G', 'A', 'C', 'T', 'A', 'C'); assess.pgs.vcf.allele.match(vcf.ref.allele, vcf.alt.allele, pgs.ref.allele, pgs.effect.allele);
Check that a PGS weight file contains the required columns for PGS application with apply.polygenic.score
.
check.pgs.weight.columns(pgs.weight.colnames, harmonized = TRUE)
check.pgs.weight.columns(pgs.weight.colnames, harmonized = TRUE)
pgs.weight.colnames |
A character vector of column names. |
harmonized |
A logical indicating whether the presence of harmonized columns should be checked. |
A logical indicating whether the file contains the required columns.
Merge overlapping PGS coordinates in multiple BED files.
combine.pgs.bed( pgs.bed.list, add.annotation.data = FALSE, annotation.column.index = 4, slop = 0 )
combine.pgs.bed( pgs.bed.list, add.annotation.data = FALSE, annotation.column.index = 4, slop = 0 )
pgs.bed.list |
A named list of data.frames containing PGS coordinates in BED format. |
add.annotation.data |
A logical indicating whether an additional annotation data column should be added to the annotation column. |
annotation.column.index |
An integer indicating the index of the column in the data frames in |
slop |
An integer indicating the number of base pairs to add to the BED interval on either side. |
A data.frame containing the merged PGS coordinates in BED format with an extra annotation column containing the name of the PGS and data from one additional column optionally selected by the user.
bed1 <- data.frame( chr = c(1, 2, 3), start = c(1, 2, 3), end = c(2, 3, 4), annotation = c('a', 'b', 'c') ); bed2 <- data.frame( chr = c(1, 2, 3), start = c(1, 20, 30), end = c(2, 21, 31), annotation = c('d', 'e', 'f') ); bed.input <- list(bed1 = bed1, bed2 = bed2); combine.pgs.bed(bed.input);
bed1 <- data.frame( chr = c(1, 2, 3), start = c(1, 2, 3), end = c(2, 3, 4), annotation = c('a', 'b', 'c') ); bed2 <- data.frame( chr = c(1, 2, 3), start = c(1, 20, 30), end = c(2, 21, 31), annotation = c('d', 'e', 'f') ); bed.input <- list(bed1 = bed1, bed2 = bed2); combine.pgs.bed(bed.input);
Match PGS SNPs to corresponding VCF information by genomic coordinates or rsID using a merge operation.
combine.vcf.with.pgs(vcf.data, pgs.weight.data)
combine.vcf.with.pgs(vcf.data, pgs.weight.data)
vcf.data |
A data.frame containing VCF data. Required columns: |
pgs.weight.data |
A data.frame containing PGS data. Required columns: |
A list containing a data.frame of merged VCF and PGS data and a data.frame of PGS SNPs missing from the VCF.
A primary merge is first performed on chromosome and base pair coordinates. For SNPs that could not be matched in the first mergs, a second merge is attempted by rsID if available. This action can account for short INDELs that can have coordinate mismatches between the PGS and VCF data. The merge is a left outer join: all PGS SNPs are kept as rows even if they are missing from the VCF, and all VCF SNPs that are not a component of the PGS are dropped. If no PGS SNPs are present in the VCF, the function will terminate with an error.
# Example VCF vcf.path <- system.file( 'extdata', 'HG001_GIAB.vcf.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); vcf.import <- import.vcf(vcf.path); # Example pgs weight file pgs.weight.path <- system.file( 'extdata', 'PGS000662_hmPOS_GRCh38.txt.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); pgs.import <- import.pgs.weight.file(pgs.weight.path); merge.data <- combine.vcf.with.pgs( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data );
# Example VCF vcf.path <- system.file( 'extdata', 'HG001_GIAB.vcf.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); vcf.import <- import.vcf(vcf.path); # Example pgs weight file pgs.weight.path <- system.file( 'extdata', 'PGS000662_hmPOS_GRCh38.txt.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); pgs.import <- import.pgs.weight.file(pgs.weight.path); merge.data <- combine.vcf.with.pgs( vcf.data = vcf.import$dat, pgs.weight.data = pgs.import$pgs.weight.data );
Convert a population allele frequency to a mean dosage for that allele.
convert.allele.frequency.to.dosage(allele.frequency)
convert.allele.frequency.to.dosage(allele.frequency)
allele.frequency |
A numeric vector of allele frequencies. |
A numeric vector of mean dosages for the allele frequencies.
allele.frequency <- seq(0.1, 0.9, 0.1); convert.allele.frequency.to.dosage(allele.frequency);
allele.frequency <- seq(0.1, 0.9, 0.1); convert.allele.frequency.to.dosage(allele.frequency);
Convert genotype calls in the form of witten out alleles (e.g. 'A/T') to dosages (0, 1, 2) based on provided risk alleles from a PGS.
convert.alleles.to.pgs.dosage(called.alleles, risk.alleles)
convert.alleles.to.pgs.dosage(called.alleles, risk.alleles)
called.alleles |
A vector of genotypes in allelic notation separated by a slash or pipe. |
risk.alleles |
A vector of risk alleles from a polygenic score corresponding to each genotype (by locus) in called.alleles. |
A vector of dosages corresponding to each genotype in called.alleles. Hemizygous genotypes (one allele e.g. 'A') are counted as 1.
called.alleles <- c('A/A', 'A/T', 'T/T'); risk.alleles <- c('T', 'T', 'T'); convert.alleles.to.pgs.dosage(called.alleles, risk.alleles);
called.alleles <- c('A/A', 'A/T', 'T/T'); risk.alleles <- c('T', 'T', 'T'); convert.alleles.to.pgs.dosage(called.alleles, risk.alleles);
Convert imported and formatted PGS compnent SNP coordinate data to BED format.
convert.pgs.to.bed( pgs.weight.data, chr.prefix = TRUE, numeric.sex.chr = FALSE, slop = 0 )
convert.pgs.to.bed( pgs.weight.data, chr.prefix = TRUE, numeric.sex.chr = FALSE, slop = 0 )
pgs.weight.data |
A data.frame containing SNP coordinate data with standardized CHROM and POS columns. |
chr.prefix |
A logical indicating whether the 'chr' prefix should be used when formatting chromosome name. |
numeric.sex.chr |
A logical indicating whether the sex chromosomes should be formatted numerically, as opposed to alphabetically. |
slop |
An integer indicating the number of base pairs to add to the BED interval on either side. |
A data.frame containing the PGS component SNP coordinate data in BED format and any other columns provided in pgs.weight.data.
pgs.weight.data <- data.frame( CHROM = c('chr1', 'chrX'), POS = c(10, 20) ); convert.pgs.to.bed(pgs.weight.data); # Switch between different chromosome notations convert.pgs.to.bed(pgs.weight.data, chr.prefix = FALSE, numeric.sex.chr = TRUE); # Add slop to BED intervals convert.pgs.to.bed(pgs.weight.data, slop = 10);
pgs.weight.data <- data.frame( CHROM = c('chr1', 'chrX'), POS = c(10, 20) ); convert.pgs.to.bed(pgs.weight.data); # Switch between different chromosome notations convert.pgs.to.bed(pgs.weight.data, chr.prefix = FALSE, numeric.sex.chr = TRUE); # Add slop to BED intervals convert.pgs.to.bed(pgs.weight.data, slop = 10);
Plot density curves of PGS data outputted by apply.polygenic.score
.
If phenotype columns are provided, multiple density curves are plotted for automatically detected categories for each categorical variable.
create.pgs.density.plot( pgs.data, phenotype.columns = NULL, output.dir = NULL, filename.prefix = NULL, file.extension = "png", tidy.titles = FALSE, width = 10, height = 10, xaxes.cex = 1.5, yaxes.cex = 1.5, titles.cex = 1.5, key.cex = 1, border.padding = 1 )
create.pgs.density.plot( pgs.data, phenotype.columns = NULL, output.dir = NULL, filename.prefix = NULL, file.extension = "png", tidy.titles = FALSE, width = 10, height = 10, xaxes.cex = 1.5, yaxes.cex = 1.5, titles.cex = 1.5, key.cex = 1, border.padding = 1 )
pgs.data |
data.frame PGS data as formatted by |
phenotype.columns |
character vector of phenotype columns in |
output.dir |
character directory to save output plots |
filename.prefix |
character prefix for output filenames |
file.extension |
character file extension for output plots |
tidy.titles |
logical whether to reformat PGS plot titles to remove periods |
width |
numeric width of output plot in inches |
height |
numeric height of output plot in inches |
xaxes.cex |
numeric size for all x-axis labels |
yaxes.cex |
numeric size for all y-axis labels |
titles.cex |
numeric size for all plot titles |
key.cex |
numeric size of color key legend |
border.padding |
numeric padding for plot borders |
If no output directory is provided, a multipanel lattice plot object is returned, otherwise a plot is written to the indicated path and NULL
is returned.
set.seed(100); pgs.data <- data.frame( PGS = rnorm(100, 0, 1) ); temp.dir <- tempdir(); # Basic Plot create.pgs.density.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'basic-plot', width = 6, height = 6 ); # Plot multiple PGS outputs pgs.data$PGS.with.normalized.missing <- rnorm(100, 1, 1); create.pgs.density.plot(pgs.data, output.dir = temp.dir); # Plot phenotype categories pgs.data$sex <- sample(c('male', 'female', 100, replace = TRUE)); create.pgs.density.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'multiple-pgs', phenotype.columns = 'sex' ); # Plot multiple phenotypes pgs.data$letters <- sample(letters[1:5], 100, replace = TRUE); create.pgs.density.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'multiple-phenotypes', phenotype.columns = c('sex', 'letters') );
set.seed(100); pgs.data <- data.frame( PGS = rnorm(100, 0, 1) ); temp.dir <- tempdir(); # Basic Plot create.pgs.density.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'basic-plot', width = 6, height = 6 ); # Plot multiple PGS outputs pgs.data$PGS.with.normalized.missing <- rnorm(100, 1, 1); create.pgs.density.plot(pgs.data, output.dir = temp.dir); # Plot phenotype categories pgs.data$sex <- sample(c('male', 'female', 100, replace = TRUE)); create.pgs.density.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'multiple-pgs', phenotype.columns = 'sex' ); # Plot multiple phenotypes pgs.data$letters <- sample(letters[1:5], 100, replace = TRUE); create.pgs.density.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'multiple-phenotypes', phenotype.columns = c('sex', 'letters') );
Plot PGS percentile rank of each sample outputted by apply.polygenic.score()
as a barplot, plot missing genotypes if any are present, plot corresponding decile and quartile markers as a heatmap, optionally plot phenotype covariates as color bars.
create.pgs.rank.plot( pgs.data, phenotype.columns = NULL, missing.genotype.style = "count", categorical.palette = NULL, binary.palette = NULL, output.dir = NULL, filename.prefix = NULL, file.extension = "png", width = 8, height = 8, xaxis.cex = 1.2, yaxis.cex = 1, titles.cex = 1.2, border.padding = 1 )
create.pgs.rank.plot( pgs.data, phenotype.columns = NULL, missing.genotype.style = "count", categorical.palette = NULL, binary.palette = NULL, output.dir = NULL, filename.prefix = NULL, file.extension = "png", width = 8, height = 8, xaxis.cex = 1.2, yaxis.cex = 1, titles.cex = 1.2, border.padding = 1 )
pgs.data |
data.frame PGS data as formatted by |
phenotype.columns |
character vector of column names in pgs.data containing phenotype covariates to plot as color bars. Default is |
missing.genotype.style |
character style of missing genotype barplot. Default is "count". Options are "count" or "percent". |
categorical.palette |
character vector of colors to use for categorical phenotype covariates. Default is |
binary.palette |
character vector of colors to use for binary and continuous phenotype covariates. Each color is contrasted with white to create a color ramp or binary categories.
Default is |
output.dir |
character directory path to write plot to file. Default is |
filename.prefix |
character prefix for plot filename. |
file.extension |
character file extension for plot file. Default is "png". |
width |
numeric width of plot in inches. |
height |
numeric height of plot in inches. |
xaxis.cex |
numeric size of x-axis labels. |
yaxis.cex |
numeric size of y-axis labels. |
titles.cex |
numeric size of plot titles. |
border.padding |
numeric padding around plot border. |
If no output directory is provided, a multipanel lattice plot object is returned, otherwise a plot is written to the indicated path and NULL
is returned.
For clarity, certain plot aspects change when sample size exceeds 50:
x-axis labels are no longer displayed
missing (NA) values are not labeled with text in heatmaps but are color-coded with a legend
Colors for continuous and binary phenotypes are chosen from the binary color palettes in BoutrosLab.plotting.general::default.colours()
.
Colors for categorical phenotypes are chosen by default from the qualitative color palette in BoutrosLab.plotting.general::default.colours()
.
set.seed(200); percentiles <- get.pgs.percentiles(rnorm(200, 0, 1)); pgs.data <- data.frame( Indiv = paste0('sample', 1:200), percentile = percentiles$percentile, decile = percentiles$decile, quartile = percentiles$quartile, n.missing.genotypes = sample(1:10, 200, replace = TRUE), percent.missing.genotypes = sample(1:10, 200, replace = TRUE) / 100, continuous.pheno = rnorm(200, 1, 1), categorical.pheno = sample(letters[1:5], 200, replace = TRUE), binary.pheno = sample(c(0,1), 200, replace = TRUE) ); temp.dir <- tempdir(); create.pgs.rank.plot( pgs.data, phenotype.columns = c('continuous.pheno', 'categorical.pheno', 'binary.pheno'), missing.genotype.style = 'percent', output.dir = temp.dir, filename.prefix = 'example-rank-plot' );
set.seed(200); percentiles <- get.pgs.percentiles(rnorm(200, 0, 1)); pgs.data <- data.frame( Indiv = paste0('sample', 1:200), percentile = percentiles$percentile, decile = percentiles$decile, quartile = percentiles$quartile, n.missing.genotypes = sample(1:10, 200, replace = TRUE), percent.missing.genotypes = sample(1:10, 200, replace = TRUE) / 100, continuous.pheno = rnorm(200, 1, 1), categorical.pheno = sample(letters[1:5], 200, replace = TRUE), binary.pheno = sample(c(0,1), 200, replace = TRUE) ); temp.dir <- tempdir(); create.pgs.rank.plot( pgs.data, phenotype.columns = c('continuous.pheno', 'categorical.pheno', 'binary.pheno'), missing.genotype.style = 'percent', output.dir = temp.dir, filename.prefix = 'example-rank-plot' );
Create scatterplots for PGS data outputed by apply.polygenic.score()
with continuous phenotype variables
create.pgs.with.continuous.phenotype.plot( pgs.data, phenotype.columns, hexbin.threshold = 1000, hexbin.colour.scheme = NULL, hexbin.colourkey = TRUE, hexbin.colourcut = seq(0, 1, length = 11), hexbin.mincnt = 1, hexbin.maxcnt = NULL, hexbin.xbins = 30, hexbin.aspect = 1, output.dir = NULL, filename.prefix = NULL, file.extension = "png", tidy.titles = FALSE, compute.correlation = TRUE, corr.legend.corner = c(0, 1), corr.legend.cex = 1.5, include.origin = FALSE, width = 10, height = 10, xaxes.cex = 1.5, yaxes.cex = 1.5, titles.cex = 1.5, point.cex = 0.75, border.padding = 1 )
create.pgs.with.continuous.phenotype.plot( pgs.data, phenotype.columns, hexbin.threshold = 1000, hexbin.colour.scheme = NULL, hexbin.colourkey = TRUE, hexbin.colourcut = seq(0, 1, length = 11), hexbin.mincnt = 1, hexbin.maxcnt = NULL, hexbin.xbins = 30, hexbin.aspect = 1, output.dir = NULL, filename.prefix = NULL, file.extension = "png", tidy.titles = FALSE, compute.correlation = TRUE, corr.legend.corner = c(0, 1), corr.legend.cex = 1.5, include.origin = FALSE, width = 10, height = 10, xaxes.cex = 1.5, yaxes.cex = 1.5, titles.cex = 1.5, point.cex = 0.75, border.padding = 1 )
pgs.data |
data.frame PGS data as formatted by |
phenotype.columns |
character vector of continuous phenotype column names in pgs.data to plot |
hexbin.threshold |
numeric threshold (exclusive) for cohort size at which to switch from scatterplot to hexbin plot. |
hexbin.colour.scheme |
character vector of colors for hexbin plot bins. Default is |
hexbin.colourkey |
logical whether a legend should be drawn for a hexbinplot, defaults to |
hexbin.colourcut |
numeric vector of values covering [0, 1] that determine hexagon colour class boundaries and hexagon legend size boundaries. Alternatively, an integer (<= hexbin.maxcnt) specifying the number of equispaced colourcut values in [0,1]. |
hexbin.mincnt |
integer, minimum count for a hexagon to be plotted. Default is 1. |
hexbin.maxcnt |
integer, maximum count for a hexagon to be plotted. Cells with more counts are not plotted. Default is |
hexbin.xbins |
integer, number of bins in the x direction for hexbin plot. Default is 30. |
hexbin.aspect |
numeric, aspect ratio of hexbin plot to control plot dimensions. Default is 1. |
output.dir |
character directory to save output plots |
filename.prefix |
character prefix for output filenames |
file.extension |
character file extension for output plots |
tidy.titles |
logical whether to reformat PGS plot titles to remove periods |
compute.correlation |
logical whether to compute correlation between PGS and phenotype and display in plot |
corr.legend.corner |
numeric vector indicating the corner of the correlation legend e.g. |
corr.legend.cex |
numeric cex for correlation legend |
include.origin |
logical whether to include the origin (zero) in plot axes |
width |
numeric width of output plot in inches |
height |
numeric height of output plot in inches |
xaxes.cex |
numeric size for x-axis labels |
yaxes.cex |
numeric size for y-axis labels |
titles.cex |
numeric size for plot titles |
point.cex |
numeric size for plot points |
border.padding |
numeric padding for plot borders |
If no output directory is provided, a multipanel lattice plot object is returned, otherwise a plot is written to the indicated path and NULL
is returned.
If no continuous phenotype variables are detected, a warning is issued and NULL
is returned.
set.seed(100); pgs.data <- data.frame( PGS = rnorm(100, 0, 1), continuous.phenotype = rnorm(100, 2, 1) ); temp.dir <- tempdir(); # Basic Plot create.pgs.with.continuous.phenotype.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'basic-plot', phenotype.columns = 'continuous.phenotype', width = 6, height = 6 ); # Plot multiple PGS outputs pgs.data$PGS.with.normalized.missing <- rnorm(100, 1, 1); create.pgs.with.continuous.phenotype.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'multiple-pgs', phenotype.columns = 'continuous.phenotype' ); # Plot multiple phenotypes pgs.data$continuous.phenotype2 <- rnorm(100, 10, 1); create.pgs.with.continuous.phenotype.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'multiple-phenotypes', phenotype.columns = c('continuous.phenotype', 'continuous.phenotype2') );
set.seed(100); pgs.data <- data.frame( PGS = rnorm(100, 0, 1), continuous.phenotype = rnorm(100, 2, 1) ); temp.dir <- tempdir(); # Basic Plot create.pgs.with.continuous.phenotype.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'basic-plot', phenotype.columns = 'continuous.phenotype', width = 6, height = 6 ); # Plot multiple PGS outputs pgs.data$PGS.with.normalized.missing <- rnorm(100, 1, 1); create.pgs.with.continuous.phenotype.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'multiple-pgs', phenotype.columns = 'continuous.phenotype' ); # Plot multiple phenotypes pgs.data$continuous.phenotype2 <- rnorm(100, 10, 1); create.pgs.with.continuous.phenotype.plot( pgs.data, output.dir = temp.dir, filename.prefix = 'multiple-phenotypes', phenotype.columns = c('continuous.phenotype', 'continuous.phenotype2') );
Flip single base pair DNA alleles to their reverse complement. INDEL flipping is not supported.
flip.DNA.allele(alleles, return.indels.as.missing = FALSE)
flip.DNA.allele(alleles, return.indels.as.missing = FALSE)
alleles |
A character vector of DNA alleles. |
return.indels.as.missing |
A logical value indicating whether to return NA for INDEL alleles. Default is |
A character vector of flipped DNA alleles. INDEL alleles are returned as is unless return.indels.as.missing
is TRUE
.
alleles <- c('A', 'T', 'C', 'G', 'ATG', NA); flip.DNA.allele(alleles);
alleles <- c('A', 'T', 'C', 'G', 'ATG', NA); flip.DNA.allele(alleles);
Format chromosome names according to user specifications.
## S3 method for class 'chromosome.notation' format(chromosome, chr.prefix, numeric.sex.chr)
## S3 method for class 'chromosome.notation' format(chromosome, chr.prefix, numeric.sex.chr)
chromosome |
A character vector of chromosome names. |
chr.prefix |
A logical indicating whether the 'chr' prefix should be used when formatting chromosome name. |
numeric.sex.chr |
A logical indicating whether the sex chromosomes should be formatted numerically, as opposed to alphabetically. |
A character vector of chromosome names formatted according to user specifications.
numeric.chr <- c(1,2,23,24); chr.with.prefix <- c('chr1', 'chr2', 'chrX', 'chrY'); format.chromosome.notation(numeric.chr, chr.prefix = TRUE, numeric.sex.chr = FALSE); format.chromosome.notation(chr.with.prefix, chr.prefix = FALSE, numeric.sex.chr = TRUE);
numeric.chr <- c(1,2,23,24); chr.with.prefix <- c('chr1', 'chr2', 'chrX', 'chrY'); format.chromosome.notation(numeric.chr, chr.prefix = TRUE, numeric.sex.chr = FALSE); format.chromosome.notation(chr.with.prefix, chr.prefix = FALSE, numeric.sex.chr = TRUE);
Calculate percentiles and report decile and quartile ranks for a vector of polygenic scores
get.pgs.percentiles(pgs, n.percentiles = NULL)
get.pgs.percentiles(pgs, n.percentiles = NULL)
pgs |
numeric vector of polygenic scores |
n.percentiles |
integer number of percentiles to calculate (optional) |
data frame with columns for percentile, decile, quartile, and optional n.percentiles
x <- rnorm(100); get.pgs.percentiles(x, n.percentiles = 20);
x <- rnorm(100); get.pgs.percentiles(x, n.percentiles = 20);
Import a PGS weight file formatted according to PGS catalog guidelines, and prepare for PGS application with apply.polygenic.score()
.
import.pgs.weight.file(pgs.weight.path, use.harmonized.data = TRUE)
import.pgs.weight.file(pgs.weight.path, use.harmonized.data = TRUE)
pgs.weight.path |
A character string indicating the path to the pgs weight file. |
use.harmonized.data |
A logical indicating whether the file should be formatted to indicate harmonized data columns for use in future PGS application. |
A list containing the file metadata and the weight data.
# Example pgs weight file pgs.weight.path <- system.file( 'extdata', 'PGS000662_hmPOS_GRCh38.txt.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); import.pgs.weight.file(pgs.weight.path); # Note, harmonized data is used by default. To disable set `use.harmonized.data = FALSE` import.pgs.weight.file(pgs.weight.path, use.harmonized.data = FALSE);
# Example pgs weight file pgs.weight.path <- system.file( 'extdata', 'PGS000662_hmPOS_GRCh38.txt.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); import.pgs.weight.file(pgs.weight.path); # Note, harmonized data is used by default. To disable set `use.harmonized.data = FALSE` import.pgs.weight.file(pgs.weight.path, use.harmonized.data = FALSE);
A wrapper for the VCF import function in the vcfR package that formats VCF data for PGS application with apply.polygenic.score()
.
import.vcf(vcf.path, info.fields = NULL, format.fields = NULL, verbose = FALSE)
import.vcf(vcf.path, info.fields = NULL, format.fields = NULL, verbose = FALSE)
vcf.path |
A character string indicating the path to the VCF file to be imported. |
info.fields |
A character vector indicating the INFO fields to be imported. |
format.fields |
A character vector indicating the FORMAT fields to be imported. |
verbose |
A logical indicating whether verbose output should be printed by vcfR. |
A list containing a tibble of VCF meta data and a tibble data.frame containing the parsed VCF data in long form.
# Example VCF vcf <- system.file( 'extdata', 'HG001_GIAB.vcf.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); vcf.data <- import.vcf(vcf.path = vcf);
# Example VCF vcf <- system.file( 'extdata', 'HG001_GIAB.vcf.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); vcf.data <- import.vcf(vcf.path = vcf);
Parse metadata from a PGS input file header.
parse.pgs.input.header(pgs.weight.path)
parse.pgs.input.header(pgs.weight.path)
pgs.weight.path |
A character string indicating the path to the pgs weight file. |
A data frame containing the metadata from the file header.
# Example pgs weight file pgs.weight.path <- system.file( 'extdata', 'PGS000662_hmPOS_GRCh38.txt.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); parse.pgs.input.header(pgs.weight.path);
# Example pgs weight file pgs.weight.path <- system.file( 'extdata', 'PGS000662_hmPOS_GRCh38.txt.gz', package = 'ApplyPolygenicScore', mustWork = TRUE ); parse.pgs.input.header(pgs.weight.path);
Phenotype data variables are automatically classified as continuous or binary and a simple linear regression or logistic regression, respectively, is run between the polygenic score and each phenotype.
Categorical phenotypes with more than two category are ignored.
If a binary variable is not formatted as a factor, it is converted to a factor using as.factor()
defaults. For logistic regression, the first level is classified as "failure" and the second "success" by glm()
defaults.
run.pgs.regression(pgs, phenotype.data)
run.pgs.regression(pgs, phenotype.data)
pgs |
numeric vector of polygenic scores |
phenotype.data |
data.frame of phenotypes |
data frame with columns for phenotype, model, beta, se, p.value, r.squared, and AUC
set.seed(200); pgs <- rnorm(200, 0, 1); phenotype.data <- data.frame( continuous.pheno = rnorm(200, 1, 1), binary.pheno = sample(c(0, 1), 200, replace = TRUE) ); run.pgs.regression(pgs, phenotype.data);
set.seed(200); pgs <- rnorm(200, 0, 1); phenotype.data <- data.frame( continuous.pheno = rnorm(200, 1, 1), binary.pheno = sample(c(0, 1), 200, replace = TRUE) ); run.pgs.regression(pgs, phenotype.data);
A utility function that writes the two data frames outputted by apply.polygenic.score to two tab-delimited text files.
write.apply.polygenic.score.output.to.file( apply.polygenic.score.output, output.dir, file.prefix = NULL )
write.apply.polygenic.score.output.to.file( apply.polygenic.score.output, output.dir, file.prefix = NULL )
apply.polygenic.score.output |
list of two data frames: |
output.dir |
character string of the path to write both output files |
file.prefix |
character string of the file prefix to use for both output files |