——标题:"管理和分析大型基因组数据"输出:BiocStyle::html_document: toc: true number_sections: false vignette: > % \VignetteIndexEntry{实验室:管理和分析大型基因组数据}% \VignetteEngine{knitr::rmarkdown} \ uspackage [utf8]{inputenc}——' ' ' {r style, echo= false, results='asis'} BiocStyle::markdown()``` ```{r setup, echo=FALSE, results='hide'}库(knitr) opts_chunk$set(cache=TRUE, error=FALSE)作者:Valerie Obenchain (vobencha@fredhutch.org)、马丁·摩根(mtmorgan@fredhutch.org
本节的目的是强调编写正确、健壮和高效的R代码的实践。# #优先1。正确:与手工示例一致('同()',' all.equal() ')健壮:支持现实的输入,例如,0长度的向量,“NA”值,…3.简单:容易懂下个月;描述它对同事的影响很容易;容易发现逻辑错误;容易提高。4. Fast, or at least reasonable given the speed of modern computers. ## Strategies 1. Profile - _Look_ at the script to understand in general terms what it is doing. - _Step_ through the code to see how it is executed, and to gain an understanding of the speed of each line. - _Time_ evaluation of select lines or simple chunks of code with `system.time()` or the `r CRANpkg("microbenchmark")` package. - _Profile_ the code with a tool that indicates how much time is spent in each function call or line -- the built-in `Rprof()` function, or packages such as `r CRANpkg("lineprof")` or `r CRANpkg("aprof")` 2. Vectorize -- operate on vectors, rather than explicit loops ```{r vectorize} x <- 1:10 log(x) ## NOT for (i in seq_along) x[i] <- log(x[i]) ``` 3. Pre-allocate memory, then fill in the result ```{r pre-allocate} result <- numeric(10) result[1] <- runif(1) for (i in 2:length(result)) result[i] <- runif(1) * result[i - 1] result ``` 4. Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once - Simple, e.g., 'hoist' constant multiplications from a `for` loop - Higher-level, e.g., use `lm.fit()` rather than repeatedly fitting the same design matrix. 5. Re-use existing, tested code - Efficient implementations of common operations -- `tabulate()`, `rowSums()` and friends, `%in%`, ... - Efficient domain-specific implementations, e.g., `r Biocpkg("snpStats")` for GWAS linear models; `r Biocpkg("limma")` for microarray linear models; `r Biocpkg("edgeR")`, `r Biocpkg("DESeq2")` for negative binomial GLMs relevant to RNASeq. - Reuse others' work -- `r Biocpkg("DESeq2")`, `r Biocpkg("GenomicRanges")`, `r Biocpkg("Biostrings")`, ..., `r CRANpkg("dplyr")`, `r CRANpkg("data.table")`, `r CRANpkg("Rcpp")`回到顶部这里有一个明显低效的函数:' ' ' {r低效}f0 <- function(n, a=2) {## stopifnot(is.integer(n) && (length(n) == 1) && ## !is.na(n) && (n > 0)) result <- numeric() for (i in seq_len(n)) result[[i]] <- a * log(i) result} ' ' '使用' system.time() '来研究该算法如何随' n '扩展,关注运行时间。' ' ' {r system.time} system.time(f0(10000)) n <- 1000 * seq(1,20,2) t <- sapply(n, function(i) system.time(f0(i))[[3]]) plot(t ~ n, type="b")' ' '记住当前的'correct'值,以及一个近似的时间' ' ' {r correct-init} n <- 10000系统。时间(预期<- f0(n))修改函数,将公共乘法器“a”提升出循环。确保“优化”的结果和原来的计算是相同的。使用' r CRANpkg("microbenchmark") '包来比较两个版本' ' ' {r hoist} f1 <- function(n, a=2) {result <- numeric() for (i in seq_len(n)) result[[i]] <- log(i) a * result}同(预期的,f1(n))库(microbenchmark) microbenchmark(f0(n), f1(n), times=5)' '采用'预分配和填充'策略' ' ' {r预分配和填充}f2 <- function(n, a=2) {result <- numeric(n) for (i in seq_len(n)) result[[i]] <- log(i) a * result}同(预期,f2(n))微基准测试(f0(n), f2(n), times=5)使用' *apply() '函数来避免显式预分配,并使向量化的机会更明显。' ' ' {r use-apply} f3 <- function(n, a=2) a * sapply(seq_len(n), log) same (expected, f3(n)) microbenchmark(f0(n), f2(n), f3(n), times=10)既然代码是在一行中显示的,显然可以很容易地向量化它。 Seize the opportunity to vectorize it: ```{r use-vectorize} f4 <- function(n, a=2) a * log(seq_len(n)) identical(expected, f4(n)) microbenchmark(f0(n), f3(n), f4(n), times=10) ``` `f4()` definitely seems to be the winner. How does it scale with `n`? (Repeat several times) ```{r vectorized-scale} n <- 10 ^ (5:8) # 100x larger than f0 t <- sapply(n, function(i) system.time(f4(i))[[3]]) plot(t ~ n, log="xy", type="b") ``` Any explanations for the different pattern of response? Lessons learned: 1. Vectorizing offers a huge improvement over iteration 2. Pre-allocate-and-fill is very helpful when explicit iteration is required. 3. `*apply()` functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it 4. Hoisting common sub-expressions can be helpful for improving performance when explicit iteration is required.回到顶部当数据太大,无法装入内存时,我们可以以块的形式遍历文件,或按字段或基因组位置将数据划分为子集。迭代-块- '打开()',读取块(s), '关闭()'。-例如,' yieldSize '参数到' Rsamtools::BamFile() ' -框架:' genomic icfiles::reduceByYield() '限制-限制到列和/或感兴趣的行-利用领域特定的格式- BAM文件和' Rsamtools::ScanBamParam() ' - BAM文件和' Rsamtools::PileupParam() ' - VCF文件和' VariantAnnotation::ScanVcfParam() ' -使用一个数据库##练习:重复通过文件:GenomicFiles:: reduceByYield()(1)产生一个块(2)映射的输入块(3)可能改变表示减少映射块' ' ' {r reduceByYield-setup} suppressPackageStartupMessages({库(GenomicFiles)库(GenomicAlignments)图书馆(Rsamtools)图书馆(TxDb.Hsapiens.UCSC.hg19.knownGene)})收益率< - #如何输入的下一块数据函数(X,…){readGAlignments (X)} < - #如何映射到每个块功能(价值,…{olaps <- findOverlaps(VALUE, roi, type="within", ignore.strand=TRUE) count <- tabulate(subjectHits(olaps), subjectLength(olaps)) setNames(count, names(roi))} reduce <- ' + ' #如何合并映射块' ' '' ' ' {r yieldFactory} yieldFactory <- #返回一个带有局部状态函数的函数(){n_records <- 0L function(X,…){aln <- readGAlignments(X) n_records <<- n_records + length(aln) message(n_records) aln}} ' ' '感兴趣的区域,像bam文件中的染色体一样命名。' ' ' {r count-overlap -roi, eval=FALSE} exByTx <- exonsBy(TxDb.Hsapiens.UCSC.hg19。/home/ubuntu/data/vobencha/LargeData/srarchive/hg19_alias. txt <- "/home/ubuntu/data/vobencha/LargeData/srarchive/hg19_alias. txtmap0 <- read.delim(fl, header=FALSE, stringsAsFactors=FALSE) seqlevels(exByTx, force=TRUE) <- setNames(map0$V1, map0$V2)迭代bam文件的函数。' ' ' {r count-overlaps, eval=FALSE} count1 <- function(filename, roi) {message(filename) ##创建并打开BAM文件bf <- BamFile(filename, yieldSize=1000000) reduceByYield(bf, yieldFactory(), map, reduce, roi=roi)} ' ' ' In action ' ' ' {r count-overlaps-doit, eval=FALSE} BAM <- "/home/ubuntu/data/vobencha/LargeData/srarchive/SRR1039508_sorted. "count <- count (bam, exByTx) ```回到顶部#文件管理# #文件类| |型的例子使用| |包名称 | |-------|-----------------------|-----------------------------|----------------------------------| | . 床| |范围注释BedFile () ' | ' r Biocpkg(“rtracklayer”)“| | .wig | |报道“WigFile()”,“BigWigFile () ' | ' r Biocpkg(“rtracklayer”)“| | .gtf | |记录模型的GTFFile()”|“r Biocpkg(“rtracklayer ")` | | | | ` makeTxDbFromGFF () ' | ' r Biocpkg(“GenomicFeatures”)“| | .2bit | |基因组序列的TwoBitFile () ' | ' r Biocpkg(“rtracklayer”)“| | .fastq |读&品质|“FastqFile () ' | ' r Biocpkg(“ShortRead”)“| |本|一致读|“BamFile () ' | ' r Biocpkg(“Rsamtools”)“| | .tbx |索引一样|“TabixFile () ' | ' r Biocpkg(“Rsamtools”)“| | .vcf电话| |变体' VcfFile() ' | ' r Biocpkg("VariantAnnotation") ' | ' ' {r rtracklayer-file-classes} ## rtracklayer menagerie suppressPackageStartupMessages(库(rtracklayer)) names(getClass("RTLFile")@subclasses)注释-不是一个一致的接口- ' open() ', ' close() ', ' import() ' / ' yield() ' / ' read*() ' -有些:通过索引选择导入(例如,'。白”,bam指数);选择(“列”);##管理文件集合' *FileList() '类- ' reduceByYield() '——遍历单个大文件- ' bplapply() ' ('r Biocpkg("BiocParallel") ')——在并行的' genome icfiles() '中对多个文件执行独立操作- 'rows'作为基因组范围限制,'columns'作为文件-每一行x列是一个_map_从文件数据到_R_ - ' reduceByRange() ', ' reduceByFile() '中的有用表示:将映射折叠成摘要表示-参见基因组文件小插图[图1](//www.anjoumacpherson.com/packages/devel/bioc/vignettes/GenomicFiles/inst/doc/GenomicFiles.pdf)回到顶部一些注意事项——迭代/限制技术保持内存需求处于控制之下,而并行计算将计算负载分配到各个节点。请记住,并行计算仍然受到每个节点上可用内存量的限制。设置和拆除集群会产生开销,在分布式内存中进行计算时更是如此。对于小型计算,并行开销可能会超过性能没有提高的好处。从并行执行中获益最多的作业是cpu密集型作业,它们操作的数据块适合内存。Biocpkg("BiocParallel") ' '为并行评估提供了一个标准化的接口,并支持主要的并行计算风格:单台计算机上的分叉和进程,特设集群,批处理调度器和云计算。默认情况下,' r Biocpkg("BiocParallel") '选择了一个适合于操作系统的并行后端,并在Unix、Mac和Windows中得到支持。一般思想:—使用' bplapply() '而不是' lapply() '—参数' BPPARAM '影响并行计算的发生方式—' MulticoreParam() ':线程在单个(非windows)机器上—' SnowParam() ':进程在同一台或不同的机器上—' BatchJobsParam() ':资源调度程序在集群上—' DoparParam() ':使用' foreach() '并行执行###这个小示例促进了并行执行的使用,并演示了' bplapply() '如何成为' lapply '的插入项。使用' system.time() '来研究当' n '从1增加到10时执行该操作需要多长时间。 Use `identical()` and `r CRANpkg("microbenchmark")` to compare alternatives `f0()` and `f1()` for both correctness and performance. `fun` sleeps for 1 second, then returns `i`. ```{r parallel-sleep} library(BiocParallel) fun <- function(i) { Sys.sleep(1) i } ## serial f0 <- function(n) lapply(seq_len(n), fun) ## parallel f1 <- function(n) bplapply(seq_len(n), fun) ```回到顶部练习:错误处理和' bpredo() ' ' r Biocpkg("BiocParallel") ' ' "捕获并返回"错误以及成功的结果。本练习演示如何访问失败任务的“traceback()”,以及如何使用“BPREDO”重新运行失败任务。关于错误处理、日志记录和调试的详细信息请参见[错误、日志和调试](//www.anjoumacpherson.com/packages/3.2/bioc/vignettes/BiocParallel/inst/doc/Errors_Logs_And_Debugging.pdf)章节。' ' ' {r parallel-bpredo-param, eval=FALSE} param <- MulticoreParam(workers = 3) param ' ' ' '跨'X'调用' sqrt() '函数;第二个元素是一个字符,会抛出和出错。{r parallel-bpredo-bplapply, eval=FALSE} X <- list(1, "2", 3) res <- bplapply(X, sqrt, BPPARAM = param) res ' ' '与错误消息一起存储在返回的列表元素中,可以使用' attr() '访问:' ' ' {r parallel-bpredo-traceback, eval=FALSE} attr(res[[2]], "traceback")通过重复调用' bplapply() '重新运行失败的结果,这次使用更正的输入数据和'BPREDO'的部分结果。' ' ' {r parallel-bpredo, eval=FALSE} .redo <- list(1,2,3) bplapply(X。redo, sqrt, BPREDO = res)' ' '回到顶部' r Biocpkg("BiocParallel") '使用[fatile .logger](http://cran.r-project.org/web/packages/futile.logger/index.html)包进行日志记录。该包有一个灵活的系统,用于过滤具有不同严重程度阈值的消息,如INFO、DEBUG、ERROR等(有关所有阈值的列表,请参阅?bpthreshold手册页)。' r Biocpkg("BiocParallel") '捕获用futile写的消息。记录器格式以及写入stdout和stderr的消息。这个函数执行一些参数检查,并具有DEBUG、WARN和info级别的日志消息。' ' ' {r logging, eval=FALSE} FUN <- function(i) {flog.debug(paste0(" 'i'的值:",i)) if (!length(i)) {flog.debug。warn("'i'缺失")NA} else if (!is(i, "numeric")) {flog.info("coercing to numeric") as.numeric(i)} else {i}} ' ' '在参数中打开日志记录,并将阈值设置为WARN。{r logging-WARN, eval=FALSE} param <- SnowParam(3, log = TRUE, threshold = "WARN") bplapply(list(1, "2", integer()), FUN, BPPARAM = param)' ' '将阈值降低到INFO和DEBUG(例如,使用' bpthreshold<- '),以查看消息是如何根据严重程度进行过滤的。回到顶部对于长时间运行的作业或未经测试的代码,设置时间限制是很有用的。_timeout_字段是允许每个工人完成任务的时间,以秒为单位。如果任务花费的时间超过_timeout_, worker将返回一个错误。_timeout_可以在参数构造过程中设置,' ' ' {r timeout_constructor} param <- SnowParam(timeout = 20) param ' '或使用\Rcode{bptimeout} setter: ' ' {r timeout_setter} bptimeout(param) <- 2 param ' ' '使用此函数通过' bplapply() '在一个数值向量'X'值上探索不同的_timeout_s。小于_timeout_的'X'返回成功,而超过_threshold_的返回错误。' ' ' {r timeout_bplapply} fun <- function(i) {Sys.sleep(i) i} ' ' '回到顶部前面的计数示例使用了' genome icfiles::reduceByFile() ',它对单个文件进行操作,实现了yield、map、reduce范式。在这个练习中,我们将使用' genome icfiles::reduceByFile() ',它在底层使用' bplapply() '来并行操作多个文件。' reduceByFile() '的主参数是一组文件和一组范围。文件被发送到工作者,并根据范围提取数据子集。大部分工作是在_MAP_函数中完成的,可选的_REDUCE_函数将每个worker的输出结合在一起。{r co-setup} suppressPackageStartupMessages({library(BiocParallel) library(genome icfiles) library(genome icalignments) library(Rsamtools)})在Unix或Mac上,配置一个带有4个worker的MulticoreParam()。打开日志记录并将超时设置为60秒。' ' ' {r co-param} param <- MulticoreParam(4, log = TRUE, timeout = 60)在Windows上使用' SnowParam() ': ' ' ' {r co-param-snow, eval=FALSE} param <- SnowParam(4, log = TRUE, timeout = 60) ``` Point to the collection of bam files. ```{r co-bams} fls <- dir("/home/ubuntu/data/vobencha/LargeData/copynumber", ".bam$", full=TRUE) names(fls) <- basename(fls) bfl <- BamFileList(fls) ``` Defining ranges (region of interest) restricts the amount of data on the workers and keeps memory requirements under control. We'll use a set of ranges on the Major Histocompatibility Complex locus on chromosome 6. ```{r co-GRanges} ranges <- GRanges("chr6", IRanges(c(28477797, 29527797, 32448354), c(29477797, 30527797, 33448354))) ``` The _MAP_ function reads in records and counts overlaps. `readGAlignments()` reads in bam records that overlap with any portion of the ranges defined in the _scanBamParam_ (i.e., they could be overlapping the start or the end). Once we've got the records in _R_, we want to count only those that fall 'within' the ranges. ```{r co-map, eval=FALSE} MAP <- function(range, file, ...) { library(GenomicAlignments) ## readGAlignments(), ScanBamParam() param = ScanBamParam(which=range) ## restriction gal = readGAlignments(file, param=param) ## log messages flog.info(paste0("file: ", basename(file))) flog.debug(paste0("records: ", length(gal))) ## overlaps olaps <- findOverlaps(gal, range, type="within", ignore.strand=TRUE) tabulate(subjectHits(olaps), subjectLength(olaps)) } ``` Count ... ```{r co-doit, eval=FALSE} cts <- reduceByFile(ranges, fls, MAP, BPPARAM = param) ``` The result is a list the same length as the number of files. ```{r co-length, eval=FALSE} length(cts) ``` Each list element is the length of the number of ranges. ```{r co-elementlengths, eval=FALSE} elementLengths(cts) ``` Tables of counts for each range are extracted with '[[': ```{r co-tables, eval=FALSE} cts[[1]] ```回到顶部##其他资源- [Bioconductor Amazon AMI](//www.anjoumacpherson.com/help/bioconductor-cloud-ami/) -轻松'spin ' 10的实例-预配置Bioconductor包和StarCluster管理- ' r Biocpkg("GoogleGenomics") '与谷歌计算云和资源交互回到顶部#资源-劳伦斯,M和摩根,M. 2014。可扩展基因组与R和Bioconductor。统计科学2014,Vol. 29, No. 2, 214-226。http://arxiv.org/abs/1409.2864v1 - BiocParallel: //www.anjoumacpherson.com/packages/release/bioc/html/BiocParallel.html - genome icfiles: //www.anjoumacpherson.com/packages/release/bioc/html/GenomicFiles.html回到顶部