Title: | Genomic Region Processing using Tools Such as 'BEDTools', 'BEDOPS' and 'Tabix' |
---|---|
Description: | Genomic regions processing using open-source command line tools such as 'BEDTools', 'BEDOPS' and 'Tabix'. These tools offer scalable and efficient utilities to perform genome arithmetic e.g indexing, formatting and merging. bedr API enhances access to these tools as well as offers additional utilities for genomic regions processing. |
Authors: | Syed Haider [aut], Daryl Waggott [aut], Emilie Lalonde [ctb], Clement Fung [ctb], Paul C. Boutros [aut, cre, cph] |
Maintainer: | Paul C. Boutros <[email protected]> |
License: | GPL-2 |
Version: | 1.0.7 |
Built: | 2024-10-29 03:45:53 UTC |
Source: | https://github.com/cran/bedr |
A bedtools wrapper that allows using a mix of internal R objects and external R files. A number of convenience functions are provided for simplifying analysis workflows in R.
Package: | bedr |
Type: | Package |
Version: | 1.0.4 |
Date: | 2016-09-19 |
License: | GPL2 |
Daryl Waggott
Syed Haider
checks if regions in object a are found in object b
x %in.region% y
x %in.region% y
x |
first region index in the form chr:start-stop. regions in this index will be checked for intersection in the values of the second index. |
y |
second region index. |
The function can also be called using syntax similar to the %in% operator, for example "region1 %in.region% region2"
Returns a logical vector the length of x.
Daryl Waggott
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr.sort.region(a); b <- bedr.sort.region(b); d <- a %in.region% b }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr.sort.region(a); b <- bedr.sort.region(b); d <- a %in.region% b }
convert a dataframe in bed format to an index string
bed2index(x, sort = TRUE)
bed2index(x, sort = TRUE)
x |
a object region object or index |
sort |
should the index be sorted |
Returns a vector of string based genomic regions
Daryl Waggott
test.regions <- get.random.regions(10); bed2index(test.regions);
test.regions <- get.random.regions(10); bed2index(test.regions);
convert bed to vcf
bed2vcf(x, filename = NULL, zero.based = TRUE, header = NULL, fasta = NULL)
bed2vcf(x, filename = NULL, zero.based = TRUE, header = NULL, fasta = NULL)
x |
the input bed object |
filename |
a filename to write to. |
zero.based |
is the file zero based i.e. bed format. defaults to true. |
header |
a list of things to put in the header. repeated elements such as INFO, FILTER, FORMAT can be put in data.frames. |
fasta |
build of the genome in fasta format |
Daryl Waggott
## Not run: bed2vcf(x) ## End(Not run)
## Not run: bed2vcf(x) ## End(Not run)
Main bedtools wrapper function.
bedr(engine = "bedtools", params = NULL, input = list(), method = NULL, tmpDir = NULL, deleteTmpDir = TRUE, outputDir = NULL, outputFile = NULL, check.chr = TRUE, check.zero.based = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE );
bedr(engine = "bedtools", params = NULL, input = list(), method = NULL, tmpDir = NULL, deleteTmpDir = TRUE, outputDir = NULL, outputFile = NULL, check.chr = TRUE, check.zero.based = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE );
engine |
What analytical engine to use i.e. bedtools, bedops, tabix, unix |
params |
A string that includes all the extra parameters and arguments to the bedtools commmand. For example if you wanted to do a left outer join you would specificy method as intersect and use params = c("-loj -header"). If you leave input and method as defaults then this is this string represents the full command. |
input |
A list of input items to be used by bedtools. Each item should be named by its parameter name in bedtools for example input = list(a=xxx, b=yyy, i=zzz). Items can be R objects or external files. R objects need to be in bed format i.e. have chr, start, stop as the first three columns, or, have an position index as the first column or rowname i.e. chr1:100-1000. |
method |
What bedtools method. This can be intersect, sort, merge etc. See bedtools documentation for specifcis. |
tmpDir |
The directory to be used for writing files |
deleteTmpDir |
Should tmp files be deleted. helpful for diagnostics. |
outputDir |
The output directory. Only used if outputFile is specified. It defaults to the current working directory. |
outputFile |
The name of the output file. If this is specified the output will be sent to a file not an R object |
check.chr |
check for chr prefix |
check.zero.based |
check for zero based coordinates |
check.valid |
do all region integrity checks |
check.sort |
check if region is sorted |
check.merge |
check if region is merged |
verbose |
Should messages be printed to screen. |
The output of command with some parsing to keep it consistent with the input.
Daryl Waggott
iranges
if (check.binary("bedtools")) { set.seed(666) index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; ### check is.a.valid <- is.valid.region(a); is.b.valid <- is.valid.region(b); a <- a[is.a.valid]; b <- b[is.b.valid]; ### sort is.sorted <- is.sorted.region(a); a.sort1 <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b.sort1 <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); a.sort2 <- bedr(engine = "bedops", input = list(i = a), method = "sort", params = ""); a.sort3 <- bedr.sort.region(a); a.sort4 <- bedr.sort.region(a, engine = "unix", method = "natural"); a.sort5 <- bedr.sort.region(a, engine = "R", method = "natural"); ### merge is.merged <- is.merged.region(a.sort1); a.merge1 <- bedr(engine = "bedtools", input = list(i = a.sort1), method = "merge", params = ""); b.merge1 <- bedr(engine = "bedtools", input = list(i = b.sort1), method = "merge", params = ""); a.merge2 <- bedr(engine = "bedops", input = list(i = a.sort1), method = "merge", params = ""); # a.merge3 <- bedr.merge.region(a); this will throw an error b/c region is not sorted ### subtract a.sub1 <- bedr(input = list(a = a.merge1, b = b.merge1), method = "subtract", params=""); a.sub2 <- bedr.subtract.region(a.merge1, b.merge1); ### in.region is.region <- in.region(a.merge1, b.merge1); #is.region <- a.merge1 %in.region% b.merge1 ### intersect # note for bedtools its recommended to bedr.sort.regions before intersect for faster processing # also if regions are not merged this can cause unexpected behaviour a.int1 <- bedr(input = list(a = a.sort1, b = b.sort1), method = "intersect", params = "-loj"); a.int1 <- bedr(input = list(a = a.sort1, b = b.sort1), method="intersect",params ="-loj -sorted"); a.int2 <- bedr(input = list(a = a.merge1, b = b.merge1), method="intersect",params="-loj -sorted"); a.int3 <- bedr.join.region(a.merge1, b.merge1); ### multiple join d <- get.random.regions(100, chr="chr1", sort = TRUE); a.mult <- bedr.join.multiple.region(x = list(a.merge1,b.merge1,bedr.sort.region(d))); ## Not run: ### groupby # note the "g" column number is based on bed format i.e. first three columns chr, start, stop # note the use of first, first, last on the region columns i.e. the union of the regions # note currently missing values are not dealt with in bedtools. also the 5th column is # assumed to be "score" and gets a default "-1" not a "." cnv.gene <- bedr( input = list(i=cnv.gene), method = "groupby", params = paste( "-g 16 -c ", paste(1:15, collapse = ","), " -o ", "first,first,last, ", paste(rep("sum",12), collapse = ","), sep = "" ) ); ### example 1 ### workflow adding gene names to exome sequencing target file # download refseq genes from ucsc or query biomart for ensemble gene names. # format them in basic bed format. # sort, merge target # sort, merge -nms target. Overlapping genes/features get merged. # this may not be ideal if there are some really big features. # intersect -loj target, genes. # alternatively, do not merge the target and apply the merge after the intersect. # this will provide precision at the level of the exon. ## End(Not run) }
if (check.binary("bedtools")) { set.seed(666) index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; ### check is.a.valid <- is.valid.region(a); is.b.valid <- is.valid.region(b); a <- a[is.a.valid]; b <- b[is.b.valid]; ### sort is.sorted <- is.sorted.region(a); a.sort1 <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b.sort1 <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); a.sort2 <- bedr(engine = "bedops", input = list(i = a), method = "sort", params = ""); a.sort3 <- bedr.sort.region(a); a.sort4 <- bedr.sort.region(a, engine = "unix", method = "natural"); a.sort5 <- bedr.sort.region(a, engine = "R", method = "natural"); ### merge is.merged <- is.merged.region(a.sort1); a.merge1 <- bedr(engine = "bedtools", input = list(i = a.sort1), method = "merge", params = ""); b.merge1 <- bedr(engine = "bedtools", input = list(i = b.sort1), method = "merge", params = ""); a.merge2 <- bedr(engine = "bedops", input = list(i = a.sort1), method = "merge", params = ""); # a.merge3 <- bedr.merge.region(a); this will throw an error b/c region is not sorted ### subtract a.sub1 <- bedr(input = list(a = a.merge1, b = b.merge1), method = "subtract", params=""); a.sub2 <- bedr.subtract.region(a.merge1, b.merge1); ### in.region is.region <- in.region(a.merge1, b.merge1); #is.region <- a.merge1 %in.region% b.merge1 ### intersect # note for bedtools its recommended to bedr.sort.regions before intersect for faster processing # also if regions are not merged this can cause unexpected behaviour a.int1 <- bedr(input = list(a = a.sort1, b = b.sort1), method = "intersect", params = "-loj"); a.int1 <- bedr(input = list(a = a.sort1, b = b.sort1), method="intersect",params ="-loj -sorted"); a.int2 <- bedr(input = list(a = a.merge1, b = b.merge1), method="intersect",params="-loj -sorted"); a.int3 <- bedr.join.region(a.merge1, b.merge1); ### multiple join d <- get.random.regions(100, chr="chr1", sort = TRUE); a.mult <- bedr.join.multiple.region(x = list(a.merge1,b.merge1,bedr.sort.region(d))); ## Not run: ### groupby # note the "g" column number is based on bed format i.e. first three columns chr, start, stop # note the use of first, first, last on the region columns i.e. the union of the regions # note currently missing values are not dealt with in bedtools. also the 5th column is # assumed to be "score" and gets a default "-1" not a "." cnv.gene <- bedr( input = list(i=cnv.gene), method = "groupby", params = paste( "-g 16 -c ", paste(1:15, collapse = ","), " -o ", "first,first,last, ", paste(rep("sum",12), collapse = ","), sep = "" ) ); ### example 1 ### workflow adding gene names to exome sequencing target file # download refseq genes from ucsc or query biomart for ensemble gene names. # format them in basic bed format. # sort, merge target # sort, merge -nms target. Overlapping genes/features get merged. # this may not be ideal if there are some really big features. # intersect -loj target, genes. # alternatively, do not merge the target and apply the merge after the intersect. # this will provide precision at the level of the exon. ## End(Not run) }
join multiple objects
bedr.join.multiple.region( x = list(), fraction.overlap = 1/1e9, empty = FALSE, missing.values = ".", cluster = FALSE, species = "human", build = "hg19", check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
bedr.join.multiple.region( x = list(), fraction.overlap = 1/1e9, empty = FALSE, missing.values = ".", cluster = FALSE, species = "human", build = "hg19", check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
x |
list of region objects |
fraction.overlap |
proportion of bases to be considered an overlap |
empty |
print rows if no match |
missing.values |
missing value character |
cluster |
TRUE/FALSE for clustering |
species |
species i.e. human or mouse |
build |
genome build to use for empty regions |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
check if region is valid |
check.sort |
check if region is sorted |
check.merge |
check if overlapping regions are merged |
verbose |
messages and checks |
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/multiinter.html
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a.sort <- bedr.sort.region(a); b.sort <- bedr.sort.region(b); d <- get.random.regions(100, chr="chr1", sort = TRUE); a.mult <- bedr.join.multiple.region(x = list(a.sort,b.sort,bedr.sort.region(d))); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a.sort <- bedr.sort.region(a); b.sort <- bedr.sort.region(b); d <- get.random.regions(100, chr="chr1", sort = TRUE); a.mult <- bedr.join.multiple.region(x = list(a.sort,b.sort,bedr.sort.region(d))); }
join two region objects using a left outer join
bedr.join.region( x, y, fraction.overlap = 1/1e9, reciporical = FALSE, report.n.overlap = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
bedr.join.region( x, y, fraction.overlap = 1/1e9, reciporical = FALSE, report.n.overlap = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
x |
object a |
y |
object b |
fraction.overlap |
proportion of overlap to be considered a match |
report.n.overlap |
should the number of overlapping bases be reported |
reciporical |
should the fraction overlap be applied to object b as well |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
check if region is valid |
check.sort |
check if region is sorted |
check.merge |
check if overlapping regions are merged |
verbose |
messages and checks |
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a.sort <- bedr.sort.region(a); b.sort <- bedr.sort.region(b); d <- bedr.join.region(a.sort, b.sort); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a.sort <- bedr.sort.region(a); b.sort <- bedr.sort.region(b); d <- bedr.join.region(a.sort, b.sort); }
merge i.e. collapse overlpaping regions
bedr.merge.region( x, distance = 0, list.names = TRUE, number = FALSE, stratify.by = NULL, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, verbose = TRUE );
bedr.merge.region( x, distance = 0, list.names = TRUE, number = FALSE, stratify.by = NULL, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, verbose = TRUE );
x |
input |
distance |
maximum distance between regions to be merged. defaults to 0 which means overlapping or bookended features. note that you can use negative distances to enforce a minimum overlap. |
list.names |
output list of names for merged items |
number |
output number of merged items |
stratify.by |
a column name indicating the groups to stratify merging within i.e. gene name. merging will not happen between groups. |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
check.sort |
should the sort order be checked |
verbose |
should log messages and checking take place |
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/merge.html
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a.sort <- bedr.sort.region(a); a.merged <- bedr.merge.region(a.sort); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a.sort <- bedr.sort.region(a); a.merged <- bedr.merge.region(a.sort); }
Visualize regions or intervals. e.g. VennDiagrams of intersections of distinct intervals, base pairs and genes.
bedr.plot.region( input, filename = NULL, type = "venn", feature = "interval", fraction.overlap = 0.000000001, group = NULL, params = list(), verbose = TRUE )
bedr.plot.region( input, filename = NULL, type = "venn", feature = "interval", fraction.overlap = 0.000000001, group = NULL, params = list(), verbose = TRUE )
input |
A list of input regions or indices |
filename |
The name of the output image file |
type |
The type of plot. only 'venn' is supported for intersections at the moment. |
feature |
How should the regions be intersected. By unique "interval", "gene", "size" or "other" to use the features in the first item in the input list. |
fraction.overlap |
Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp). |
group |
A grouping parameter for barplots. Possible values include "input", "chr", or a categorical vector of length equal to the sum of the input. |
params |
Additional parameters for plotting or intersecting. See |
verbose |
Include text messages. |
By default a venn diagram is output. If a single input is used then the plot shows the number of unique and collapsed regions after applying a merge.
Plots!
Daryl Waggott
if (check.binary("bedtools")) { # example data a <- get.random.regions(n = 1000, chr = "chr22", size.mean = 10) b <- get.random.regions(n = 1000, chr = "chr22", size.mean = 10) d <- get.random.regions(n = 1000, chr = "chr22", size.mean = 10) e <- get.random.regions(n = 1000, chr = "chr22", size.mean = 10) f <- get.random.regions(n = 1000, chr = "chr22", size.mean = 10) pdf("bedr.plot.region.ex.pdf") # basic venn diagrams bedr.plot.region(input = list(a=a,b=b)) bedr.plot.region(input = list(a=a,b=b,d=d)) #bedr.plot.region(input = list(a=a,b=b,d=d,e=e)) #bedr.plot.region(input = list(a=a,b=b,d=d,e=e,f=f)) ### change venn parameters bedr.plot.region( input = list(a=a,b=b,d=d), params = list(lty = 2, label.col = "black", main = "Region Overlap") ) ### try with different #bedr.plot.region(input = list(a=a,b=b), feature = "gene") #bedr.plot.region(input = list(a=a,b=b), feature = "reference") #bedr.plot.region(input = list(a=a,b=b), feature = "interval") #bedr.plot.region(input = list(a=a,b=b), feature = "cluster") #bedr.plot.region(input = list(a=a,b=b), feature = "bp") dev.off() }
if (check.binary("bedtools")) { # example data a <- get.random.regions(n = 1000, chr = "chr22", size.mean = 10) b <- get.random.regions(n = 1000, chr = "chr22", size.mean = 10) d <- get.random.regions(n = 1000, chr = "chr22", size.mean = 10) e <- get.random.regions(n = 1000, chr = "chr22", size.mean = 10) f <- get.random.regions(n = 1000, chr = "chr22", size.mean = 10) pdf("bedr.plot.region.ex.pdf") # basic venn diagrams bedr.plot.region(input = list(a=a,b=b)) bedr.plot.region(input = list(a=a,b=b,d=d)) #bedr.plot.region(input = list(a=a,b=b,d=d,e=e)) #bedr.plot.region(input = list(a=a,b=b,d=d,e=e,f=f)) ### change venn parameters bedr.plot.region( input = list(a=a,b=b,d=d), params = list(lty = 2, label.col = "black", main = "Region Overlap") ) ### try with different #bedr.plot.region(input = list(a=a,b=b), feature = "gene") #bedr.plot.region(input = list(a=a,b=b), feature = "reference") #bedr.plot.region(input = list(a=a,b=b), feature = "interval") #bedr.plot.region(input = list(a=a,b=b), feature = "cluster") #bedr.plot.region(input = list(a=a,b=b), feature = "bp") dev.off() }
Initialize some config settings for bedr. This includes downloading useful datasets if requested.
bedr.setup(datasets = "all", data.dir = paste0(Sys.getenv("HOME"), "/bedr/data"))
bedr.setup(datasets = "all", data.dir = paste0(Sys.getenv("HOME"), "/bedr/data"))
datasets |
A list of datasets to download. Defaults to "all" i.e. c("refgene","hg19","b37","hugo", "cosmic","clinvar") |
data.dir |
A directory to put the data. Defaults to ~/bedr/data |
The default config file is at ~/bedr/config.yml. It's in yaml format.
Daryl Waggott
## Not run: bedr.setup() ## End(Not run)
## Not run: bedr.setup() ## End(Not run)
Sort and merge regions object in one step
bedr.snm.region( x, method = "lexicographical", distance = 0, list.names = TRUE, number = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, verbose = TRUE )
bedr.snm.region( x, method = "lexicographical", distance = 0, list.names = TRUE, number = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, verbose = TRUE )
x |
a object region object or index |
method |
natural or lexicographic |
distance |
distance between regions to be merged |
list.names |
output list of names for merged items |
number |
output number of merged items |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
verbose |
should log messages and checking take place |
Sorted and merged regions object
Daryl Waggott
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- bedr.snm.region(a); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- bedr.snm.region(a); }
sort a region file
bedr.sort.region( x, method = "lexicographical", engine = "R", chr.to.num = c("X" = 23, "Y" = 24, "M" = 25), check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.merge = TRUE, verbose = TRUE )
bedr.sort.region( x, method = "lexicographical", engine = "R", chr.to.num = c("X" = 23, "Y" = 24, "M" = 25), check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.merge = TRUE, verbose = TRUE )
x |
a region object or index |
method |
natural or lexicographic |
engine |
what analytical engine to use for sorting i.e. bedtools, bedops, gnu unix |
chr.to.num |
chromosome letter names to numbers map. Defaults to Homo sapiens i.e c("X" = 23, "Y" = 24, "M" = 25) |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checked for integerity |
check.merge |
should overlapping regions be checked |
verbose |
should log messages and checking take place |
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/sort.html
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- bedr.sort.region(a); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- bedr.sort.region(a); }
subtracts features or ranges in object b from object a
bedr.subtract.region( x, y, fraction.overlap = 1/1e9, remove.whole.feature = TRUE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
bedr.subtract.region( x, y, fraction.overlap = 1/1e9, remove.whole.feature = TRUE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
x |
item a |
y |
item b |
fraction.overlap |
what portion of A to be considered an overlap |
remove.whole.feature |
should whole feature be removed |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
check.sort |
check if region is sorted |
check.merge |
check if overlapping regions are merged |
verbose |
messages and checks |
Regions exclusive to one object of regions.
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/subtract.html
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); d <- bedr.subtract.region(a,b); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); d <- bedr.subtract.region(a,b); }
outputs text if verbose flag is set
catv(x)
catv(x)
x |
some text string |
Prints the text string
Daryl Waggott
verbose <- TRUE; catv("Hello Universe!"); verbose <- FALSE; catv("Goodbye Universe!")
verbose <- TRUE; catv("Hello Universe!"); verbose <- FALSE; catv("Goodbye Universe!")
check if a binary is in the path. Specifically used for bedtools, bedops and tabix.
check.binary(x = "bedtools", verbose = TRUE)
check.binary(x = "bedtools", verbose = TRUE)
x |
a string referring to a binary/executable i.e. bedtools, bedops, tabix |
verbose |
print log |
Used internally to determine functionality selection options.
TRUE or FALSE
Daryl Waggott
check.binary("bedtools")
check.binary("bedtools")
cluster intervals
cluster.region( x, distance = 0, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, verbose = TRUE )
cluster.region( x, distance = 0, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, verbose = TRUE )
x |
The region |
distance |
maximum distance between regions to be merged. defaults to 0 which means overlapping or bookended features. note that you can use negative distances to enforce a minimum overlap. |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
check.sort |
should regions be checked for sort order |
verbose |
should log messages and checking take place |
clusters adjacent features of a specified distance.
A data.frame in bed format
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/cluster.html
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- cluster.region(a, distance = 0); d <- cluster.region(a, distance = 100); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- cluster.region(a, distance = 0); d <- cluster.region(a, distance = 100); }
checks if an object can be converted into a bed style data.frame then does the conversion.
convert2bed( x, set.type = TRUE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
convert2bed( x, set.type = TRUE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
x |
A region index (i.e. chr1:10-100, ...) either as a vector or row.names/first column of a data.frame. Or a data.frame with the first three columns "chr", s"start", "end" |
set.type |
should the attribute input.type be set. Sometimes it is desirable to avoid setting it when applying intermediate conversion |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checked for integerity |
check.sort |
should the region be checked to see if it is sorted |
check.merge |
should the region be checked for overlapping regions |
verbose |
messages and text |
Very useful to convert data before using other bedr functions
Returns x converted to bedformat, as a data frame
Daryl Waggott
## Not run: a.bed <- convert.to.bed(a) ## End(Not run)
## Not run: a.bed <- convert.to.bed(a) ## End(Not run)
output R objects as tmpfiles
create.tmp.bed.file(x, name = "bedr", tmpDir = NULL)
create.tmp.bed.file(x, name = "bedr", tmpDir = NULL)
x |
region object |
name |
a prefix for the tmp file. |
tmpDir |
where should the temp files be put |
Daryl Waggott
# create tmp file
# create tmp file
Determine input format whether its tabular or bed
determine.input(x, check.chr = FALSE, verbose = TRUE)
determine.input(x, check.chr = FALSE, verbose = TRUE)
x |
input vector, matrix or dataframe |
check.chr |
check whether the coordinates are in chromosomal format with chr prefix |
verbose |
messages and checks |
integer value. index format (0), bed (1), index in first column (2), rownmames are index (3), unrecognized(4)
Daryl Waggott
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; bedr:::determine.input(a); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; bedr:::determine.input(a); }
Take data frame and return a list of rownames where column value is not 0 i.e. missing
df2list(x, start.col = 1)
df2list(x, start.col = 1)
x |
data frame |
start.col |
offset from first column to ignore certain columns |
returns a list following cleanup and change of data structure
Daryl Waggott
## Not run: df2list(data.frame(a = 1:10, b = 11:20)); ## End(Not run)
## Not run: df2list(data.frame(a = 1:10, b = 11:20)); ## End(Not run)
Download some useful datasets. Some functions such as plotting and fasta querying require additional data.
download.datasets(datasets = "all", data.dir = paste0(Sys.getenv("HOME"), "/bedr/data"))
download.datasets(datasets = "all", data.dir = paste0(Sys.getenv("HOME"), "/bedr/data"))
datasets |
A list of datasets to download. Defaults to "all" i.e. c("refgene","hg19","b37","hugo", "cosmic","clinvar") |
data.dir |
A directory to put the data. Defaults to ~/bedr/data |
External datasets are required for some bedr functionality. For example, plotting intersections based on genes, get.fasta, bed2vcf and validate.gene.names. If these datasets already exist you can simply place symlinks in a directory and use bedr.setup() to define the data.dir.
some datasets.
Daryl Waggott
## Not run: download.datasets("cosmic"); ## End(Not run)
## Not run: download.datasets("cosmic"); ## End(Not run)
Get adjacent flanks from regions
flank.region( x, n.add = NULL, start.add = NULL, end.add = NULL, species = "human", build = "hg19", check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
flank.region( x, n.add = NULL, start.add = NULL, end.add = NULL, species = "human", build = "hg19", check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
x |
a object region object or index |
n.add |
the number of bases to be selected from each side of a region |
start.add |
the number of based to be selected from the start of a region |
end.add |
the number of based to be selected from the end of a region |
species |
the species i.e. human or mouse |
build |
the genome build i.e. hg19 or mm10 |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
check.sort |
should regions be checked for sort order |
check.merge |
should overlapping regions be checked |
verbose |
should log messages and checking take place |
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/flank.html
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- flank.region(a, n.add = 20); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- flank.region(a, n.add = 20); }
Gets the length of each chromosome for a species/build. Choices are human (hg18, hg19, hg38), mouse(mm9, mm10)
get.chr.length(chr = NULL, species = "human", build = "hg19")
get.chr.length(chr = NULL, species = "human", build = "hg19")
chr |
a vector of chromosomes to query. defaults to all. |
species |
species |
build |
build |
A dataframe with chromosome annotations
Daryl Waggott
size <- get.chr.length();
size <- get.chr.length();
return a set of regions for the examples and unit testing
get.example.regions()
get.example.regions()
A list with three example regions
Daryl Waggott
index <- get.example.regions()
index <- get.example.regions()
Query fasta sequence using bedtools get.fasta
get.fasta( x, fasta = NULL, bed12 = FALSE, strand = FALSE, output.fasta = FALSE, use.name.field = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
get.fasta( x, fasta = NULL, bed12 = FALSE, strand = FALSE, output.fasta = FALSE, use.name.field = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
x |
region or index |
fasta |
a fasta file defaults to mini example hg19 human |
bed12 |
should bed12 format be used |
strand |
strand specific i.e. reverse complement negative |
output.fasta |
output a fasta defaults to a data.frame for easier parsing |
use.name.field |
should the name field be used for |
check.zero.based |
check for zero based region |
check.chr |
check for "chr" prefix |
check.valid |
check for valid regions i.e. start < end |
check.sort |
check if region is sorted |
check.merge |
check if region is merged |
verbose |
print progress |
Uses bedtools getFasta to query a fasta file and load into R as a data.frame for easy parsing.
Note that the hg19 reference genome fasta is large and requires on the order of 4 GB RAM to avoid a segfault happens.
A data.frame or fasta. The data.frame has is two columns corresponding to the region and the sequence.
Daryl Waggott, Syed Haider
http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html
if (check.binary("bedtools")) { ## Not run: # get the sequence for a set of regions as a data.frame index <- get.example.regions(); a <- index[[1]]; b <- get.fasta(bedr.sort.region(a)); # this time output a fasta d <- get.fasta(b, output.fasta = TRUE); writeLines(d[[1]], con = "test.fasta"); # get the region adjacent to a set of mutations in a vcf clinvar.vcf.example <- system.file( "extdata/clinvar_dbSNP138_example.vcf.gz", package = "bedr" ); clinvar <- read.vcf(clinvar.vcf.example); # note that clinvar uses ncbi fasta which does not use "chr" and codes chrM as MT clinvar.bed <- data.frame( chr = paste0("chr", gsub("MT", "M", clinvar$vcf$CHROM)), start = clinvar$vcf$POS - 2, end = clinvar$vcf$POS + 1, stringsAsFactors = FALSE ); # get trinucleotide sequences of variants on chr M only mutation.triplet <- get.fasta( clinvar.bed[which(clinvar.bed$chr == "chrM"), ], fasta = system.file("extdata/ucsc.hg19.chrM.fasta", package = "bedr"), check.chr = FALSE ); ## End(Not run) }
if (check.binary("bedtools")) { ## Not run: # get the sequence for a set of regions as a data.frame index <- get.example.regions(); a <- index[[1]]; b <- get.fasta(bedr.sort.region(a)); # this time output a fasta d <- get.fasta(b, output.fasta = TRUE); writeLines(d[[1]], con = "test.fasta"); # get the region adjacent to a set of mutations in a vcf clinvar.vcf.example <- system.file( "extdata/clinvar_dbSNP138_example.vcf.gz", package = "bedr" ); clinvar <- read.vcf(clinvar.vcf.example); # note that clinvar uses ncbi fasta which does not use "chr" and codes chrM as MT clinvar.bed <- data.frame( chr = paste0("chr", gsub("MT", "M", clinvar$vcf$CHROM)), start = clinvar$vcf$POS - 2, end = clinvar$vcf$POS + 1, stringsAsFactors = FALSE ); # get trinucleotide sequences of variants on chr M only mutation.triplet <- get.fasta( clinvar.bed[which(clinvar.bed$chr == "chrM"), ], fasta = system.file("extdata/ucsc.hg19.chrM.fasta", package = "bedr"), check.chr = FALSE ); ## End(Not run) }
generates a set of random regions for a specific species/build. Choices are human (hg18, hg19), mouse(mm9, mm10). regions are sampled from a log-normal distribution.
get.random.regions( n = 10, chr = NULL, species = "human", build = "hg19", size.mean = 10, size.sd = 0.25, mask.gaps = FALSE, mask.repeats = FALSE, sort.output = TRUE, verbose = TRUE )
get.random.regions( n = 10, chr = NULL, species = "human", build = "hg19", size.mean = 10, size.sd = 0.25, mask.gaps = FALSE, mask.repeats = FALSE, sort.output = TRUE, verbose = TRUE )
n |
number of regions |
chr |
the chr or region |
species |
species |
build |
build |
size.mean |
region mean in log space |
size.sd |
region sd in log space |
mask.gaps |
should the gaps (Ns) in the human reference be ignored as potential start points. This drammatically increases memory and run time but is more appropriate in almost all settings. By default it's off. |
mask.repeats |
should the repeats from repeatMasker be ignored as potential start points. This drammatically increases memory and run time but is more appropriate in almost all settings. By default it's off. |
sort.output |
return a sorted index |
verbose |
words |
Daryl Waggott
test.regions <- get.random.regions(100)
test.regions <- get.random.regions(100)
Get adjacent flanks from regions
grow.region( x, n.add = NULL, start.add = NULL, end.add = NULL, species = "human", build = "hg19", check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
grow.region( x, n.add = NULL, start.add = NULL, end.add = NULL, species = "human", build = "hg19", check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
x |
a object region object or index |
n.add |
the number of bases to be selected from each side of a region |
start.add |
the number of based to be selected from the start of a region |
end.add |
the number of based to be selected from the end of a region |
species |
the species i.e. human or mouse |
build |
the genome build i.e. hg19 or mm10 |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
check.sort |
should regions be checked for sort order |
check.merge |
should overlapping regions be checked |
verbose |
should log messages and checking take place |
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/slop.html
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- grow.region(a, n.add = 20); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- grow.region(a, n.add = 20); }
checks if regions in object a are found in object b
in.region( x, y, proportion.overlap = 1e-09, reciprocal.overlap = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = FALSE )
in.region( x, y, proportion.overlap = 1e-09, reciprocal.overlap = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = FALSE )
x |
first region index in the form chr:start-stop. regions in this index will be checked for intersection in the values of the second index. |
y |
second region index. |
proportion.overlap |
Defaults 1e-9 which is 1 bp. See details below for the different interpretation between 0 and 1 based overlap |
reciprocal.overlap |
Should the proportion.overlap be reciprocal |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
check if region is valid |
check.sort |
check if region is sorted |
check.merge |
check if overlapping regions are merged |
verbose |
prints some debugging information. currently it just checks if the input regions are overlapping |
The function can also be called using syntax similar to the %in% operator, for example "region1 %in.region% region2"
The default is to report TRUE if there is 1bp overlap in zero based bed format. That means that region chr1:10-20 and chr1:20-30 would not overlap. To switch to one based intuitive interpretation set proportion.overlap = 0.
Returns a logical vector the length of x.
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); d <- in.region(a,b); # alternative calling d <- a %in.region% b }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); d <- in.region(a,b); # alternative calling d <- a %in.region% b }
convert a region index into a bed file dataframe
index2bed(x, set.type = TRUE)
index2bed(x, set.type = TRUE)
x |
an index |
set.type |
should the attribute input.type be set. Sometimes it is desirable to avoid setting it when applying intermediate conversion |
Daryl Waggott
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a.bed <- index2bed(a); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a.bed <- index2bed(a); }
checks if region file is merged
is.merged.region( x, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, verbose = FALSE )
is.merged.region( x, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, verbose = FALSE )
x |
region or index |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
check if region is valid |
check.sort |
check if region is sorted |
verbose |
more words |
Daryl Waggott
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- is.merged.region(a); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- is.merged.region(a); }
checks if region file is sorted
is.sorted.region( x, method = "lex", engine = "unix", check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.merge = TRUE, verbose = FALSE )
is.sorted.region( x, method = "lex", engine = "unix", check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.merge = TRUE, verbose = FALSE )
x |
The region index, bed file, or bed formatted object |
method |
lexicgraphical or natural, lex is required for many operations but natural is better for interpretation |
engine |
what analytical engine to use for sorting i.e. bedtools, bedops, gnu unix |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
check if region is valid |
check.merge |
check if region is merged |
verbose |
more words |
Daryl Waggott
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- is.sorted.region(a); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- is.sorted.region(a); }
verifies the reference sequence in a vcf
is.valid.ref( x, fasta = NULL, strand = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
is.valid.ref( x, fasta = NULL, strand = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
x |
input bed object |
fasta |
a reference build in fasta format |
strand |
should strand be used. if reverse then the sequence will be reverse complemented |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
check.sort |
should regions be checked for sort order |
check.merge |
should overlapping regions be checked |
verbose |
should log messages and checking take place |
a logical vector the length of the input
Daryl Waggott
vcf.path <- system.file("extdata/callerA.vcf.gz", package = "bedr"); vcf.data <- read.vcf(vcf.path, split.info = TRUE); vcf.data$vcf <- vcf.data$vcf[, c("CHROM", "POS", "END", setdiff(colnames(vcf.data$vcf), c("CHROM", "POS", "END"))) ]; vcf.data$vcf$CHROM <- paste("chr", vcf.data$vcf$CHROM, sep = ""); ## Not run: # need reference sequence FASTA and index file to run this, as 'fasta' parameter is.valid.ref(vcf.data); ## End(Not run)
vcf.path <- system.file("extdata/callerA.vcf.gz", package = "bedr"); vcf.data <- read.vcf(vcf.path, split.info = TRUE); vcf.data$vcf <- vcf.data$vcf[, c("CHROM", "POS", "END", setdiff(colnames(vcf.data$vcf), c("CHROM", "POS", "END"))) ]; vcf.data$vcf$CHROM <- paste("chr", vcf.data$vcf$CHROM, sep = ""); ## Not run: # need reference sequence FASTA and index file to run this, as 'fasta' parameter is.valid.ref(vcf.data); ## End(Not run)
checks if region/index is valid
is.valid.region( x, check.zero.based = TRUE, check.chr = TRUE, throw.error = FALSE, verbose = TRUE )
is.valid.region( x, check.zero.based = TRUE, check.chr = TRUE, throw.error = FALSE, verbose = TRUE )
x |
The region index, bed file, or bed formatted object |
check.zero.based |
should basic test for zero based coordinates be checked |
check.chr |
should the algorithm check for the "chr" prefix |
throw.error |
should an error be thrown. The default is to report a logical vector of inconsistencies. |
verbose |
should diagnostic messages be printed |
Daryl Waggott
index <- get.example.regions(); a <- index[[1]]; is.valid <- is.valid.region(a); a.valid <- a[is.valid];
index <- get.example.regions(); a <- index[[1]]; is.valid <- is.valid.region(a); a.valid <- a[is.valid];
verifies that sequences are correct given coordinates and a reference
is.valid.seq( x, querySeq, fasta = NULL, strand = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
is.valid.seq( x, querySeq, fasta = NULL, strand = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
x |
input bed object |
querySeq |
a vector of sequences the same length as x |
fasta |
a reference build in fasta format |
strand |
should strand be used. if reverse then the sequence will be reverse complemented |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
check.sort |
should regions be checked for sort order |
check.merge |
should overlapping regions be checked |
verbose |
should log messages and checking take place |
a logical vector the length of the input querySeq
Daryl Waggott, Syed Haider
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- get.fasta(bedr.sort.region(a)); is.valid.seq(x = a, querySeq = a$sequence); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- get.fasta(bedr.sort.region(a)); is.valid.seq(x = a, querySeq = a$sequence); }
calculate the jaccard distance between sets of intervals
jaccard( x, y, proportion.overlap = 1e-09, reciprocal.overlap = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
jaccard( x, y, proportion.overlap = 1e-09, reciprocal.overlap = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
x |
first region to be compared |
y |
second region to be compared |
proportion.overlap |
Defaults 1e-9 which is 1 bp. See details below for the different interpretation between 0 and 1 based overlap |
reciprocal.overlap |
Should the proportion.overlap be reciprocal |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
check.sort |
should regions be checked for sort order |
check.merge |
should overlapping regions be checked |
verbose |
should log messages and checking take place |
The Jaccard metric is the ratio of intersections to unions. The process can be tweaked by changing the proportion of overlap and even growiwng the regions.
A short vector.
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html
reldist
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); jaccard(a,b); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); jaccard(a,b); }
Interface to R's modifyList
modifyList2(x, val)
modifyList2(x, val)
x |
a named list |
val |
a named list with components to be updated using x |
modified version of x
Daryl Waggott
modifyList
Helps if you don't want to use sort region on a huge dataset
order.region( x, method = "lex", check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.merge = TRUE )
order.region( x, method = "lex", check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.merge = TRUE )
x |
index or bed style data.frame |
method |
natural or lexicographical (lex) |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
check if region is valid |
check.merge |
check if region is sorted and merged |
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); a.order <- order.region(a) b <- a[a.order]; }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); a.order <- order.region(a) b <- a[a.order]; }
permute a set of regions
permute.region( x, stratify.by.chr = FALSE, species = "human", build = "hg19", chr.names = paste0("chr",c(1:22,"X","Y","M")), mask.gaps = FALSE, gaps.file = NULL, mask.repeats = FALSE, repeats.file = NULL, sort.output = TRUE, is.checked = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, verbose = TRUE )
permute.region( x, stratify.by.chr = FALSE, species = "human", build = "hg19", chr.names = paste0("chr",c(1:22,"X","Y","M")), mask.gaps = FALSE, gaps.file = NULL, mask.repeats = FALSE, repeats.file = NULL, sort.output = TRUE, is.checked = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, verbose = TRUE )
x |
regions to permute |
stratify.by.chr |
Should the permutation be happen separetely for each chromosome. That is are chromosomes exchangeable. |
species |
species |
build |
the build of the reference |
chr.names |
names of the chromosomes to use |
mask.gaps |
should the gaps (Ns) in the human reference be ignored as potential start points. This drammatically increases memory and run time but is more appropriate in almost all settings. It defaults to off |
gaps.file |
database file of gaps. Defaults to Homo sapiens Hg19 gap.txt.gz file available through UCSC |
mask.repeats |
should the repeats from repeatMasker be ignored as potential start points. This drammatically increases memory and run time but is more appropriate in almost all settings. By default it's off |
repeats.file |
database file of repeats as supplied by UCSC containing RepMasker data e.g rmsk.txt.gz |
sort.output |
should the output be sorted |
is.checked |
Has the input data already be tested for validity. This is often done once before multiple permutations. |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
verbose |
should log messages and checking take place |
1. Sampling with replacement on region length. 2. Sampling with replacement on start positions. Positions that contain Ns in the reference are set to 0 weight during sampling.
Regions that overlap the end of a chromosome or gap are trimmed.
Steps 1 and 2 are done within chromosomes if stratify.by.chr is set to true.
A region object with randomized start positions.
Daryl Waggott
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); a.perm <- permute.region(a); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); a.perm <- permute.region(a); }
process.input
process.input( input, tmpDir = NULL, include.names = TRUE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
process.input( input, tmpDir = NULL, include.names = TRUE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
input |
regions input or a file in one of the standard formats. these are bed, vcf, gff, bam, sam, csv, tsv, txt |
tmpDir |
The directory to be used for writing files |
include.names |
should the names of the input files be included in the output |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checked for integerity |
check.sort |
should the region sorting be checked |
check.merge |
should overlapping regions be checked |
verbose |
messages and checks |
list of input files
Daryl Waggott
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); a.processed <- process.input(a, verbose = FALSE) }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); a.processed <- process.input(a, verbose = FALSE) }
read a ucsc table into R
query.ucsc( x, mirror = "http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database", download = TRUE, overwrite.local = FALSE, columns.keep = NULL, verbose = TRUE )
query.ucsc( x, mirror = "http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database", download = TRUE, overwrite.local = FALSE, columns.keep = NULL, verbose = TRUE )
x |
a ucsc data table. Include the full path including "txt.gz" extenstion to load from a local file. Note that $HOME/bedr/data will be checked first before downloading. |
mirror |
the ucsc mirror |
download |
should the data be downloaded to $HOME/bedr/data/ |
overwrite.local |
should the local version be overwritten if it exists |
columns.keep |
what columns to load. this can help with very large tables where you only want 'chr,start,end'. defaults to all. you may have to check the sql for the actual column names. |
verbose |
more words |
tables can be found at http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database
A data.frame
Daryl Wagott
## Not run: query.ucsc("refGene"); ## End(Not run)
## Not run: query.ucsc("refGene"); ## End(Not run)
Read a vcf into R and parse it for downstream analysis
read.vcf(x, split.info = FALSE, split.samples = FALSE, nrows = -1, verbose = TRUE)
read.vcf(x, split.info = FALSE, split.samples = FALSE, nrows = -1, verbose = TRUE)
x |
A vcf |
split.info |
Split the info into columns |
split.samples |
Split the sample into columns. If multiple samples then a list matrices will be created, one matrix for each element in the FORMAT tag. |
nrows |
The the number of rows to be read. Set to 0 to parse the header. |
verbose |
print progress |
The function can be slow for splitting the INFO, FORMAT for large VCFs.
VCF representation in R as a list. The first element in the list is the header, the second the body of the VCF. Every repeating tag in the header i.e. INFO, FORMAT is structured as matrix. If reading a multi-sample VCF and the split.sample = TRUE, then a matrix is added to the list for every tag in the FORMAT string.
Note that by default the vcf is returned as a data.table not a data.frame. Therefore there are some quirks i.e. subsetting via named columns a$vcf[,"CHROM", with = FALSE]. If in doubt just caset to data.frame.
Daryl Waggott
clinVar.vcf.example <- system.file("extdata/clinvar_dbSNP138_example.vcf.gz", package = "bedr") singleSample.vcf.example <- system.file("extdata/singleSampleOICR_example.vcf.gz", package = "bedr") multiSample.vcf.example <- system.file("extdata/multiSampleOICR_example.vcf.gz", package = "bedr") # read a subset of NCBI clinVar vcf. Note this has no samples. vcf1.a <- read.vcf(clinVar.vcf.example) vcf1.b <- read.vcf(clinVar.vcf.example, split.info = TRUE) ## Not run: # same as above but split multiple samples vcf1.c <- read.vcf(clinVar.vcf.example, split.info = TRUE, split.sample = TRUE) # read a single-sample VCF system.time( vcf2.a <- read.vcf(singleSample.vcf.example, split.info = TRUE, split.sample = TRUE) ) # read a multi-sample VCF vcf3.a <- read.vcf(multiSample.vcf.example, split.info = FALSE, split.sample = TRUE); # multi core example options("cores"=9); vcf2.a <- read.vcf(singleSample.vcf.example, split.info = TRUE, split.sample = TRUE) options("cores"=1); ## End(Not run)
clinVar.vcf.example <- system.file("extdata/clinvar_dbSNP138_example.vcf.gz", package = "bedr") singleSample.vcf.example <- system.file("extdata/singleSampleOICR_example.vcf.gz", package = "bedr") multiSample.vcf.example <- system.file("extdata/multiSampleOICR_example.vcf.gz", package = "bedr") # read a subset of NCBI clinVar vcf. Note this has no samples. vcf1.a <- read.vcf(clinVar.vcf.example) vcf1.b <- read.vcf(clinVar.vcf.example, split.info = TRUE) ## Not run: # same as above but split multiple samples vcf1.c <- read.vcf(clinVar.vcf.example, split.info = TRUE, split.sample = TRUE) # read a single-sample VCF system.time( vcf2.a <- read.vcf(singleSample.vcf.example, split.info = TRUE, split.sample = TRUE) ) # read a multi-sample VCF vcf3.a <- read.vcf(multiSample.vcf.example, split.info = FALSE, split.sample = TRUE); # multi core example options("cores"=9); vcf2.a <- read.vcf(singleSample.vcf.example, split.info = TRUE, split.sample = TRUE) options("cores"=1); ## End(Not run)
Calculate the relative distance between two sets of intervals
reldist( x, y, detail = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort =TRUE, check.merge = TRUE, verbose = TRUE )
reldist( x, y, detail = FALSE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort =TRUE, check.merge = TRUE, verbose = TRUE )
x |
firt region to be compared |
y |
second region to be compared |
detail |
should the relative distance be printed for every region |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
check.sort |
should regions be checked for sort order |
check.merge |
should overlapping regions be checked |
verbose |
should log messages and checking take place |
The frequency of relative distances in bins spanning 0 to 0.5
Daryl Waggott
http://bedtools.readthedocs.org/en/latest/content/tools/reldist.html
jaccard
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); reldist(a,b); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); reldist(a,b); }
Get region size
size.region( x, zero.based = TRUE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, verbose = TRUE )
size.region( x, zero.based = TRUE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, verbose = TRUE )
x |
region in vector, matrix or dataframe format |
zero.based |
whether the coordinates are zero-based or 1 |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checked for integerity |
verbose |
messages and checks |
size/length of the region
Daryl Waggott
convert2bed
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a.sizes <- bedr:::size.region(a); }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; a.sizes <- bedr:::size.region(a); }
split a vector of strings into tabular data
strsplit2matrix(x, split, fixed = FALSE, perl = FALSE)
strsplit2matrix(x, split, fixed = FALSE, perl = FALSE)
x |
a character vector |
split |
the character or regex to split on |
fixed |
fixed i.e. no regex |
perl |
per style |
Daryl Waggott
## Not run: a.bed <- strSplitToMatrix(x); ## End(Not run)
## Not run: a.bed <- strSplitToMatrix(x); ## End(Not run)
Main bedtools wrapper function.
tabix( region, file.name, params = NULL, tmpDir = NULL, deleteTmpDir = TRUE, outputDir = NULL, outputFile = NULL, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
tabix( region, file.name, params = NULL, tmpDir = NULL, deleteTmpDir = TRUE, outputDir = NULL, outputFile = NULL, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE, check.merge = TRUE, verbose = TRUE )
region |
The regions to query the tabix'd file |
file.name |
The name of the bzipped/indexed tabix file to query |
params |
A string that includes all the extra parameters and arguments to the bedtools commmand. For example if you wanted to do a left outer join you would specificy method as intersect and use params = c("-loj -header"). If you leave input and method as defaults then this is this string represents the full command. |
tmpDir |
The directory to be used for writing files |
deleteTmpDir |
Should tmp files be deleted. helpful for diagnostics. |
outputDir |
The output directory. Only used if outputFile is specified. It defaults to the current working directory. |
outputFile |
The name of the output file. If this is specified the output will be sent to a file not an R object |
check.chr |
check for chr prefix |
check.zero.based |
check for zero based coordinates |
check.valid |
do all region integrity checks |
check.sort |
check if region is sorted |
check.merge |
check if region is merged |
verbose |
Should messages be printed to screen. |
The output of command with some parsing to keep it consistent with the input.
Daryl Waggott
genomicRanges
if (check.binary("tabix")) { query.regions <- c("1:1000-100000", "1:1000000-1100000") cosmic.vcf.example <- system.file( "extdata/CosmicCodingMuts_v66_20130725_ex.vcf.gz", package = "bedr" ) cosmic.query <- tabix(query.regions, cosmic.vcf.example, check.chr = FALSE) }
if (check.binary("tabix")) { query.regions <- c("1:1000-100000", "1:1000000-1100000") cosmic.vcf.example <- system.file( "extdata/CosmicCodingMuts_v66_20130725_ex.vcf.gz", package = "bedr" ) cosmic.query <- tabix(query.regions, cosmic.vcf.example, check.chr = FALSE) }
Plot venn diagram of regions intersect
table2venn(x, var.names)
table2venn(x, var.names)
x |
intersect table of regions |
var.names |
names of the overlapping regions |
venn diagram input list
Daryl Waggott
bedr.plot.region
Compare sets of regions via jaccard and relative distance using permutation to get an empirical p-value.
test.region.similarity( x, y, n = 1000, stratify.by.chr = FALSE, species = "human", build = "hg19", mask.gaps = FALSE, mask.repeats = FALSE, gaps.file = NULL, repeats.file = NULL, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, verbose = TRUE )
test.region.similarity( x, y, n = 1000, stratify.by.chr = FALSE, species = "human", build = "hg19", mask.gaps = FALSE, mask.repeats = FALSE, gaps.file = NULL, repeats.file = NULL, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, verbose = TRUE )
x |
first region to be compared. this is the region that is permuted |
y |
second region to be compared |
n |
the number of iterations to permute |
stratify.by.chr |
Should the permutation be happen separetely for each chromosome. That is are chromosomes exchangeable. |
species |
species |
build |
the build of the reference |
mask.gaps |
should the gaps (Ns) in the human reference be ignored as potential start points. This drammatically increases memory and run time but is more appropriate in almost all settings. By default it's off. |
mask.repeats |
should the repeats from repeatMasker be ignored as potential start points. This drammatically increases memory and run time but is more appropriate in almost all settings. By default it's off. |
gaps.file |
database file of gaps. Defaults to Homo sapiens Hg19 gap.txt.gz file available through UCSC |
repeats.file |
database file of repeats as supplied by UCSC containing RepMasker data e.g rmsk.txt.gz |
check.zero.based |
should 0 based coordinates be checked |
check.chr |
should chr prefix be checked |
check.valid |
should the region be checkded for integerity |
verbose |
should log messages and checking take place |
Iteratively permutes intervals and recalculates jaccard and reldist statistics.
A list of results
Daryl Waggott
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); # a simple example test.region.similarity(a, b, n = 8) # note you can set the cores available to parallelize options(cores = 4); system.time(test.region.similarity(a, b, n = 8)); # a real example comparing the distribution of mRNA vs ncRNA genes in RefSeq ## Not run: # more core options(cores = 8); # load refgene refgene <- query.ucsc("refGene") refgene <- refgene[,c("chrom","txStart","txEnd","name","name2","strand")] # only include canonical chr chr <- paste0("chr", c(1:22,"X","Y")); refgene <- refgene[refgene$chrom # remove genes with multiple positions duplicated.gene <- duplicated(refgene$name2) | duplicated(rev(refgene$name2)); refgene <- refgene[!duplicated.gene,]; # only select pr coding genes refgene.nm <- refgene[grepl("^NM",refgene$name),]; # only select non protein coding rna genes refgene.nr <- refgene[grepl("^NR",refgene$name),]; # sort and merge refgene.nm <- bedr.snm.region(refgene.nm,check.chr = FALSE); refgene.nr <- bedr.snm.region(refgene.nr,check.chr = FALSE); test.region.similarity(refgene.nm, refgene.nr, mask.unmapped = TRUE ); option(core = 1) ## End(Not run) }
if (check.binary("bedtools")) { index <- get.example.regions(); a <- index[[1]]; b <- index[[2]]; a <- bedr(engine = "bedtools", input = list(i = a), method = "sort", params = ""); b <- bedr(engine = "bedtools", input = list(i = b), method = "sort", params = ""); # a simple example test.region.similarity(a, b, n = 8) # note you can set the cores available to parallelize options(cores = 4); system.time(test.region.similarity(a, b, n = 8)); # a real example comparing the distribution of mRNA vs ncRNA genes in RefSeq ## Not run: # more core options(cores = 8); # load refgene refgene <- query.ucsc("refGene") refgene <- refgene[,c("chrom","txStart","txEnd","name","name2","strand")] # only include canonical chr chr <- paste0("chr", c(1:22,"X","Y")); refgene <- refgene[refgene$chrom # remove genes with multiple positions duplicated.gene <- duplicated(refgene$name2) | duplicated(rev(refgene$name2)); refgene <- refgene[!duplicated.gene,]; # only select pr coding genes refgene.nm <- refgene[grepl("^NM",refgene$name),]; # only select non protein coding rna genes refgene.nr <- refgene[grepl("^NR",refgene$name),]; # sort and merge refgene.nm <- bedr.snm.region(refgene.nm,check.chr = FALSE); refgene.nr <- bedr.snm.region(refgene.nr,check.chr = FALSE); test.region.similarity(refgene.nm, refgene.nr, mask.unmapped = TRUE ); option(core = 1) ## End(Not run) }
Convert a vcf to a bed file. Currently, it needs to read into R via read.vcf
vcf2bed(x, filename = NULL, header = FALSE, other = NULL, verbose = TRUE)
vcf2bed(x, filename = NULL, header = FALSE, other = NULL, verbose = TRUE)
x |
a vcf object |
filename |
the name of file if you want it exported |
header |
indicate if the bed file has header or not when exported |
other |
fields to include apart from chr, start, end. |
verbose |
more words |
A bed styled R object or an external file
Daryl Waggott
clinVar.vcf.example <- system.file("extdata/clinvar_dbSNP138_example.vcf.gz", package = "bedr") x <- read.vcf(clinVar.vcf.example) x.bed <- vcf2bed(x)
clinVar.vcf.example <- system.file("extdata/clinvar_dbSNP138_example.vcf.gz", package = "bedr") x <- read.vcf(clinVar.vcf.example) x.bed <- vcf2bed(x)
write a vcf object
write.vcf(x, filename = NULL, verbose = TRUE)
write.vcf(x, filename = NULL, verbose = TRUE)
x |
a vcf object |
filename |
a filename |
verbose |
more words |
The input needs to be a vcf object. This
A vcf file
Daryl Waggott
vcf format specifications
vcf <- read.vcf(system.file("extdata/clinvar_dbSNP138_example.vcf.gz", package = "bedr")); vcf$header <- c(vcf$header, NOTE="vcf processed by bedr") ## Not run: write.vcf(vcf, filename = paste(tempdir(), "/bedr.example.vcf", sep = "")); ## End(Not run)
vcf <- read.vcf(system.file("extdata/clinvar_dbSNP138_example.vcf.gz", package = "bedr")); vcf$header <- c(vcf$header, NOTE="vcf processed by bedr") ## Not run: write.vcf(vcf, filename = paste(tempdir(), "/bedr.example.vcf", sep = "")); ## End(Not run)