Introduction to bedr package

Author(s): Syed Haider, Daryl Waggott, Emilie Lalonde, Clement Fung, Paul C. Boutros

Dated: 2024-10-29

Introduction

The bedr package is a suite of tools for genomic interval processing. The philosophy is to wrap existing best practice bioinformatic software in order to provide a unifying analysis environment within R. bedr should be considered complimentary to native implementations of interval processing such GenomicRanges.

The advantages to this approach include:

  • Speed. Algorithms such as intersections have been heavily optimized in other software.
  • Size. The operations happen outside of R and therefore are not constrained by memory to the same extent that R is. Moreover, bedr allows for a hybrid analysis incorporating both R objects and huge on disk files.
  • Consistency. A vast number of standardized pipelines already use these tools, therefore simplifying implementation and integration.
  • Variety. Multiple tools “engines” are wrapped allowing for flexibility in choosing the ideal operation or cryptic functionality.
  • Visualization. bedr outputs are standardised and readily support various genomic visualisations such as genomic tracks and euler diagrams.
  • Abstraction. The existing utilities are only a subset of the native functionality found within each of the engines. To directly access the full extent of options one can directly call the bedr including the required methods and parameters.

Third Party Tools

The current implementation focuses on three excellent tools. For specifics on functionality please visit there online documentation, primary citation and biostar posts. To gain the functionality of these analytical engines you will need to have the programs installed and in your default PATH. In future versions the source code for these dependencies may be distributed together.

  1. bedtools docs and source
  2. bedops docs and source
  3. tabix docs and source

General Region Utilities

  • is.valid.region
  • bedr.sort.region & is.sorted.region
  • bedr.merge.region & is.merged.region
  • in.region & order.region
  • bedr.subtract.region
  • flank.region & grow.region
  • intersectRegion
  • bedr.join.region & bedr.join.multiple.region
  • get.RegionSize
  • get.fastaRegion
  • read.vcf
  • groupby
  • test.region.similarity (jaccard & reldist)

These utilities are the main components of the package and can be combined to perform powerful genome arithmetic. Description of these utilities and associated parameters can be found in R docs. Working examples of some of the key functions is covered below:

Load bedr Library

Loading of bedr package will indicate the presensece of bedtools, bedops and tabix libraries on your system. If these are not present, there is very little value in using bedr package. If you have installed these in a non-standard path/directory of your computer, please add these to PATH environment variable.

# load bedr library
library("bedr");
## 
## 
## ######################
## #### bedr v1.0.7 ####
## ######################
## 
## checking binary availability...
##   * Checking path for bedtools... FAIL
##   * Checking path for bedops... FAIL
##   * Checking path for tabix... FAIL
## tests and examples will be skipped on R CMD check if binaries are missing

N.B. Log info printed through verbose = T (default, in almost all bedr methods) should be carefully assessed. Often, status=FAIL may just highlight a potential problem, and hence code continues to execute without a graceful failure. Therefore, please do read the log messages carefully.

Validate Region

First check if the regions are valid. This involves checking for “chr” prefix, data types, end > start position and for compliance to bed formats zero based start position. All checks can be turned off as required. The “chr” check is useful due to various human reference formats (NCBI vs UCSC) having different standards. This can result in unexpected results if comapring across specifications. Similarly, bed format uses a zero based start postion, but vcf’s use a one based position. Therefore, a snp at position 100 would be chr1:100-100 in one based but chr1:99-100 in zero based format. Another common mistake relates to overlaps between adjacent intervals, for example in zero based setups chr1:10-100 does not intersect with chr1:100-110.

if (check.binary("bedtools")) {

    # get example regions
    index <- get.example.regions();
    a <- index[[1]];
    b <- index[[2]];

    # region validation
    is.a.valid  <- is.valid.region(a);
    is.b.valid  <- is.valid.region(b);
    a <- a[is.a.valid];
    b <- b[is.b.valid];
        
    # print
    cat(" REGION a: ", a, "\n");
    cat(" REGION b: ", b, "\n");
    }
##   * Checking path for bedtools... FAIL

Sort Region

Generally, it’s a good idea to confirm that you’ve sorted your inputs to avoid unexpected results and take advantage of optimized algorithms. For example merging and clustering require adjacent intervals. is.sorted.region is convenient for explicit evaluation although it’s done internally for core operations.

if (check.binary("bedtools")) {

    # check if already sorted
    is.sorted <- is.sorted.region(a);

    # sort lexographically
    a.sort <- bedr.sort.region(a);

    # sort naturally
    a.sort.natural <- bedr.sort.region(a, method = "natural");

    # sort - explicit call using primary API function bedr()
    b.sort <- bedr(
        engine = "bedtools", 
        input = list(i = b), 
        method = "sort", 
        params = ""
        );

    # print
    cat(" REGION a: ", a.sort, "\n");
    cat(" REGION b: ", a.sort.natural, "\n");
    cat(" REGION c: ", b.sort, "\n");
    }
##   * Checking path for bedtools... FAIL

Merge Regions

Similarly, merging adjacent or overlapping regions is often required to avoid redundancy. Performing an intersection/join when you have redundant regions can cause unexpected results and hence best is to merge redundant regions.

if (check.binary("bedtools")) {

    # check if already merged (non-overlapping regions)
    is.merged <- is.merged.region(a.sort);
    is.merged <- is.merged.region(b.sort);

    # merge
    a.merge <- bedr.merge.region(a.sort);

    # merge - explicit call using primary API function bedr()
    b.merge <- bedr(
        engine = "bedtools", 
        input = list(i = b.sort), 
        method = "merge", 
        params = ""
        );

    # print
    cat(" REGION a: ", a.merge, "\n");
    cat(" REGION b: ", b.merge, "\n");
    }
##   * Checking path for bedtools... FAIL

Subtract Region

The subtract utility identifies regions exclusive to one (first) set of regions. For instance, one might be interested in removing non-coding regions from gene regions.

if (check.binary("bedtools")) {

    # subtract
    a.sub1 <- bedr.subtract.region(a.merge, b.merge);

    # subtract - explicit call using primary API function bedr()
    a.sub2 <- bedr(
        input = list(a = a.merge, b = b.merge), 
        method = "subtract", 
        params = "-A"
        );

    # print
    cat(" REGION a - sub1: ", a.sub1, "\n");
    cat(" REGION a - sub2: ", a.sub2, "\n");
    }
##   * Checking path for bedtools... FAIL

in Region

Finding genomic regions which overlap partially with another set of regions is an extremely useful requirement for various sequence analysis tasks e.g finding genes/SNPs which may fall in particular segments (e.g chromosomal bands).

if (check.binary("bedtools")) {

    # check if present in a region
    is.region <- in.region(a.merge, b.merge);

    # or alternatively R-like in command
    is.region <- a.merge %in.region% b.merge

    # print
    cat(" is.region: ", is.region, "\n");
    }
##   * Checking path for bedtools... FAIL

Intersect/Join Regions

The intersect/join function identifies sub-regions of first set which overlap with the second set of regions, and returns a table with chromosome name along with coordinates of overlapping sub-region. The bedr.join.multiple.region function creates an intersect table for multiple regions displaying the overlapping regions along with a fast access truth table for all versus all regions.

if (check.binary("bedtools")) {

    # intersect / join
    a.int1 <- bedr.join.region(a.merge, b.merge);
    a.int2 <- bedr(
        input = list(a = a.sort, b = b.sort), 
        method = "intersect", 
        params = "-loj -sorted"
        );

    # multiple join
    d <- get.random.regions(15, chr="chr1", sort = TRUE);
    a.mult <- bedr.join.multiple.region(
        x = list(a.merge, b.merge, bedr.sort.region(d))
        );

    # print
    cat(" REGION a intersect: \n"); print(a.int1); cat("\n");
    cat(" REGION multi (a,b,c) intersect: \n"); print(a.mult); cat("\n");
    }
##   * Checking path for bedtools... FAIL

Statistically Quantify Regions’ Similarity (jaccard and reldist)

When comparing (pairwise) the extent of overlap between a large collection of genomic regions, it becomes necessary to report the similarity (intersect) using a single quantitative measure. Bedtools implements two such statitics; jaccard and reldist (Favorov A et al.) which can be called through bedr as shown below:

if (check.binary("bedtools")) {

    # compare a and b set of sequences
    jaccard.stats <- jaccard(a.sort, b.sort);
    reldist.stats <- reldist(a.sort, b.sort);

    # print
    cat(" JACCARD a & b: \n"); print(jaccard.stats); cat("\n");
    cat(" RELDIST a & b: \n"); print(reldist.stats); cat("\n");

    # even better way to run both jaccard and reldist, as well as estimate P value through random permutations
    jaccard.reldist.stats <- test.region.similarity(a.sort, b.sort, n = 40);

    cat(" JACCARD/RELDIST a & b: \n"); print(jaccard.reldist.stats); cat("\n");
    }
##   * Checking path for bedtools... FAIL

GroupBy

For multiple genomic features/regions mapping to the exact locus (e.g SNPs), it is often useful to collapse their additional annotations into a single record. To collapse, one may be interested in using quantitative operations such as sum, mean, median (see bedtools for all the options) or simple concatenate multiple entries.

if (check.binary("bedtools")) {

    # read example regions file
    regions.file <- system.file("extdata/example-a-region.bed", package = "bedr");
    a <- read.table(regions.file, header = FALSE, stringsAsFactors = FALSE);
    colnames(a) <- c("a.CHROM", "a.START", "a.END", "Score");

    # sort
    a <- bedr.sort.region(a);

    # group by (on first three columns) the score in column 4. Concatenate scores
    a.collapsed <- bedr(
        input = list(i = a), 
        method = "groupby", 
        params = "-g 1,2,3 -c 4 -o collapse"
        );

    # group by (on first three columns) the score in column 4. Compute mean
    a.mean <- bedr(
        input = list(i = a), 
        method = "groupby", 
        params = "-g 1,2,3 -c 4 -o mean"
        );

    # print
    cat(" REGION a groupby (collapsed): \n"); print(a.collapsed); cat("\n");
    cat(" REGION a groupby (mean): \n"); print(a.mean); cat("\n");
    }
##   * Checking path for bedtools... FAIL

Example Workflow 1: Compare Variant Callers

The example workflow below reads two VCF files from different variant callers, further limiting to structural variants only and finally identifying common calls.

if (check.binary("bedtools")) {
    VALID.SV.TYPES <- c('BND', 'CNV', 'DEL', 'DUP', 'INS', 'INV');
    POSITION.COLUMNS <- c('CHROM', 'POS', 'END');

    callerA.filename <- system.file("extdata/callerA.vcf.gz", package = "bedr");
    callerB.filename <- system.file("extdata/callerB.vcf.gz", package = "bedr");

    # read the VCF file
    callerA <- read.vcf(callerA.filename, split.info = TRUE)$vcf;
    callerB <- read.vcf(callerB.filename, split.info = TRUE)$vcf;

    # focus on SVs
    callerA <- callerA[which(callerA$SVTYPE %in% VALID.SV.TYPES), ];
    callerB <- callerB[which(callerB$SVTYPE %in% VALID.SV.TYPES), ];

    # convert to zero-based coordinates
    callerA$POS <- callerA$POS - 1;
    callerB$POS <- callerB$POS - 1;

    # find all overlapping pairs, retrieve size of overlap (bp)
    overlapping.pairs <- bedr.join.region(
        callerA[, POSITION.COLUMNS],
        callerB[, POSITION.COLUMNS],
        report.n.overlap = TRUE,
        check.chr = FALSE
        );
    colnames(overlapping.pairs) <- c(
        'a.CHROM', 'a.POS', 'a.END',
        'b.CHROM', 'b.POS', 'b.END',
        'Overlap'
        );
    overlapping.pairs$b.POS <- as.numeric(overlapping.pairs$b.POS);
    overlapping.pairs$b.END <- as.numeric(overlapping.pairs$b.END);

    # compute a distance between overlapping pairs
    min.breakpoint.distances <- cbind(
        overlapping.pairs$a.POS - overlapping.pairs$b.POS,
        overlapping.pairs$a.END - overlapping.pairs$b.END
        );

    min.breakpoint.distances <- apply(
        abs(min.breakpoint.distances),
        1,
        min
        );
    a.length <- overlapping.pairs$a.END - overlapping.pairs$a.POS;
    b.length <- overlapping.pairs$b.END - overlapping.pairs$b.POS;

    overlapping.pairs$distance  <- (min.breakpoint.distances + abs(a.length - b.length)) / 2;

    # print
    cat(" OVERLAPPING PAIRS: \n"); print(head(overlapping.pairs)); cat("\n");
    }
##   * Checking path for bedtools... FAIL

Example Workflow 2: Exome Target Processing

This example builds on the Example Workflow 1, and annotates the structural variants with RefSeq gene identifiers if the variant falls within the gene body for instance. Limiting to chromosome 1 and 2 only. Lastly compare which genes were common between the calls generated by two variant callers.

if (check.binary("bedtools")) {

    # get Human RefSeq genes (Hg19) in BED format
    refseq.file <- system.file("extdata/ucsc.hg19.RefSeq.chr1-2.txt.gz", package = "bedr");
    refseq <- read.table(refseq.file, header = FALSE, stringsAsFactors = FALSE);

    # sort Refseq and remove chr prefix
    refseq.sorted <- bedr.sort.region(refseq[, 1:4]);
    colnames(refseq.sorted) <- c(POSITION.COLUMNS, "Gene");
    refseq.sorted$CHROM <- gsub("^chr", "", refseq.sorted$CHROM);

    # reuse the SV calls from workflow 1 and add gene identifiers to callerA
    callerA.annotated <- bedr.join.region(
        callerA[, POSITION.COLUMNS],
        refseq.sorted,
        report.n.overlap = TRUE,
        check.chr = FALSE
        );
    colnames(callerA.annotated) <- c(
        'a.CHROM', 'a.POS', 'a.END',
        'Gene.CHROM', 'Gene.POS', 'Gene.END', 'Gene', 'Overlap'
        );

    # reuse the SV calls from workflow 1 and add gene identifiers to callerB
    callerB.annotated <- bedr.join.region(
        callerB[, POSITION.COLUMNS],
        refseq.sorted,
        report.n.overlap = TRUE,
        check.chr = FALSE
        );
    colnames(callerB.annotated) <- c(
        'b.CHROM', 'b.POS', 'b.END',
        'Gene.CHROM', 'Gene.POS', 'Gene.END', 'Gene', 'Overlap'
        );

    # reinstate chr prefix to chromosome names
    callerA.annotated$a.CHROM <- paste('chr', callerA.annotated$a.CHROM, sep = "");
    callerB.annotated$b.CHROM <- paste('chr', callerB.annotated$b.CHROM, sep = "");

    # print
    cat(" CALLER A GENES (chr 1,2): \n"); print(head(callerA.annotated)); cat("\n");
    cat(" CALLER B GENES (chr 1,2): \n"); print(head(callerB.annotated)); cat("\n");
    }
##   * Checking path for bedtools... FAIL

Next, you may want to collapse multiple gene entries in one row. This can be done using groupby utility.

options("warn" = -1);
if (check.binary("bedtools")) {

    # collapse column 7 (Gene) against unique composite key (column 1, 2 and 3)
    callerA.annotated.grouped <- bedr(
        input = list(i = callerA.annotated), 
        method = "groupby", 
        params = "-g 1,2,3 -c 7 -o collapse"
        );

    # collapse column 7 (Gene) against unique composite key (column 1, 2 and 3)
    callerB.annotated.grouped <- bedr(
        input = list(i = callerB.annotated), 
        method = "groupby", 
        params = "-g 1,2,3 -c 7 -o collapse"
        );

    # print
    cat(" CALLER A GENES (chr 1,2) GROUPED: \n"); print(head(callerA.annotated.grouped)); cat("\n");
    cat(" CALLER B GENES (chr 1,2) GROUPED: \n"); print(head(callerB.annotated.grouped)); cat("\n");
    }
##   * Checking path for bedtools... FAIL

You may also want to summarise the counts of overlapping and unique sub-regions. This can be done by bedr.plot.region method which displays the summary as a venn diagram.

if (check.binary("bedtools")) {

    # merge and plot overlapping regions between the two callers with genes 
    # on chromosome 1 and 2
    callerA.merged <- callerA.annotated[, c('a.CHROM', 'a.POS', 'a.END')];
    callerB.merged <- callerB.annotated[, c('b.CHROM', 'b.POS', 'b.END')];
    callerA.merged <- bedr.merge.region(callerA.merged);
    callerB.merged <- bedr.merge.region(callerB.merged);

    # plot (sub-regions exclusive to a and b, and sub-regions in common)
    bedr.plot.region(
        input = list(
            a = callerA.merged, 
            b = callerB.merged
            ),
        params = list(lty = 2, label.col = "black", main = "Genes Overlap"),
        feature = 'interval',
        verbose = FALSE
        );
    }

The venn diagram shows the number of sub-regions on chromosome 1 and 2 which are common between the two callers, and the ones which are exclusive to the callers.

Example Workflow 3: Copy Number Recurrence

In this example, we show bedr can be used to process CNA segmented data to estimate percent genome altered and identify minimal common regions. The copy-number data used is a subset of TCGA COAD study. Note: For efficiency reason, this is a very coarse example of calling recurrent copy number regions, and only to demonstrate how bedr can be used alongside other copy number tools.

if (check.binary("bedtools")) {

    # read copy number segmented (regions) data
    cna.file <- system.file("extdata/CNA.segmented.txt.gz", package = "bedr");
    cna.data <- read.table(cna.file, header = TRUE, stringsAsFactors = FALSE);
    cna.data$Chromosome <- paste("chr", cna.data$Chromosome, sep = "");

    # check if regions are valid
    valid.segments <- is.valid.region(cna.data[, c("Chromosome", "Start", "End")]);
    cat(" VALID REGIONS: ", length(which(valid.segments) ==  TRUE), "/", length(valid.segments), "\n");

    # restrict to copy number regions with |log-ratio| > 0.10
    cna.data <- cna.data[which(abs(cna.data$Segment_Mean) > 0.10), ];

    # create a list data-structure for every patient's copy number data
    # and sort them
    cna.data.gain <- list();
    cna.data.loss <- list();
    pga.gain <- list();
    pga.loss <- list();
    sample.ids <- unique(cna.data$Sample);
    hg19.size <- 3137161264;
    for (sample.id in sample.ids) {

        # extract sample specific gains
        gain.index <- which(cna.data$Sample == sample.id & cna.data$Segment_Mean > 0);
        if (length(gain.index) > 0) {

            cna.data.gain[[sample.id]] <- cna.data[
                gain.index,
                2:4
                ];

            # sort
            cna.data.gain[[sample.id]] <- bedr.sort.region(
                x = cna.data.gain[[sample.id]],
                method = "natural",
                verbose = FALSE
                );

            # estimate percent genome gained
            pga.gain[[sample.id]] <- sum(
                apply(
                    cna.data.gain[[sample.id]],
                    1,
                    FUN = function(x) { return( (as.numeric(x[3]) - as.numeric(x[2])) ); }
                    )
                );
            pga.gain[[sample.id]] <- pga.gain[[sample.id]]/hg19.size*100;
            }

        # extract sample specific losses
        loss.index <- which(cna.data$Sample == sample.id & cna.data$Segment_Mean < 0);
        if (length(loss.index) > 0) {

            cna.data.loss[[sample.id]] <- cna.data[
                loss.index,
                2:4
                ];

            # sort
            cna.data.loss[[sample.id]] <- bedr.sort.region(
                x = cna.data.loss[[sample.id]],
                method = "natural",
                verbose = FALSE
                );

            # estimate percent genome loss
            pga.loss[[sample.id]] <- sum(
                apply(
                    cna.data.loss[[sample.id]],
                    1,
                    FUN = function(x) { return( (as.numeric(x[3]) - as.numeric(x[2])) ); }
                    )
                );
            pga.loss[[sample.id]] <- pga.loss[[sample.id]]/hg19.size*100;
            }
        }
    }

You may want to see the distribution of Percent Genome Altered (Gain and Loss), as shown below:

if (check.binary("bedtools")) {

    # set graphics params
    par(mfrow = c(1, 2), las = 1);

    # plot histograms of Gain and Loss frequencies
    hist(unlist(pga.gain), xlab = "Percent Gain", main = "Percent Genome Altered (Gain)");
    hist(unlist(pga.loss), xlab = "Percent Loss", main = "Percent Genome Altered (Loss)");
    }

Lets find minimal common regions either gained and deleted across patients:

if (check.binary("bedtools")) {

    # find minimal common regions (gain)
    mcr.gain <- bedr.join.multiple.region(
        x = cna.data.gain,
        species = "human",
        build = "hg19",
        check.valid = FALSE,
        check.sort = FALSE,
        check.merge = FALSE,
        verbose = FALSE
        );

    # find minimal common regions (loss)
    mcr.loss <- bedr.join.multiple.region(
        x = cna.data.loss,
        species = "human",
        build = "hg19",
        check.valid = FALSE,
        check.sort = FALSE,
        check.merge = FALSE,
        verbose = FALSE
        );

    # reorder by frequency of recurrence
    mcr.gain <- mcr.gain[order(as.numeric(mcr.gain$n.overlaps), decreasing = TRUE), ];
    mcr.loss <- mcr.loss[order(as.numeric(mcr.loss$n.overlaps), decreasing = TRUE), ];

    # print
    cat(" RECURRENT CNA GAINS \n"); print(head(mcr.gain[, 1:5])); cat("\n");
    cat(" RECURRENT CNA LOSSES \n"); print(head(mcr.loss[, 1:5])); cat("\n");
    }
##   * Checking path for bedtools... FAIL

Summary

In a nutshell, bedr offers an R-style interface to bedtools and bedops, and therefore, the extensive use cases and workflows manipulating genomic intervals using these tools can be implemented using bedr. If you would like to contribute a use case or have a feature request, please do not hesitate to contact us.

Acknowledgements

The examples here are in whole or part based upon data generated by The Cancer Genome Atlas pilot project established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at http://cancergenome.nih.gov/.

    • Introduction to bedr package
    • Table of Contents
    • Introduction
    • Third Party Tools
    • General Region Utilities
    • Example Workflow 1: Compare Variant Callers
    • Example Workflow 2: Exome Target Processing
    • Example Workflow 3: Copy Number Recurrence
    • Summary
    • Acknowledgements