——标题:“20.1—使用大数据”作者:“马丁·摩根 " output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{20.1——工作与大数据}% \VignetteEngine{knitr::rmarkdown}——' ' ' {r style, echo = FALSE, results = 'asis'} knitr::opts_chunk$set(eval=as.logical(Sys。gettenv ("KNITR_EVAL", "TRUE")),缓存=as.logical(Sys. value)。采用“KNITR_CACHE”,“真正的”)))作者:[马丁·摩根][]
日期:2019年7月26日
[Martin Morgan]: mailto: Martin.Morgan@RoswellPark.org #激励示例附加[TENxBrainData][]实验数据包' ' ' {r, message = FALSE}库(TENxBrainData)' ' '加载一个非常大的'摘要实验` ```{r} tenx <- TENxBrainData() tenx化验(tenx)' ' '快速执行基本操作' ' ' {r} log1p(化验(tenx)){r} tenx_子集<- tenx[,1:1000] lib_size <- colsum(化验(tenx_子集))hist(log10(lib_size))数据稀疏,超过92%的细胞等于0 {r} sum(化验(tenx_子集)== 0)/ prod(dim(tenx_子集))编写或使用高效的r_代码-这是最重要的一步!避免不必要的复制' ' ' {r} n <- 50000 ##在每个迭代set.seed(123)系统上复制' res1 '。time({res1 <- NULL for (i in 1:n) res1 <- c(res1, rnorm(1))}) ##预分配set.seed(123)系统。时间({res2 <- numeric(n) for (i in 1:n) res2[i] <- rnorm(1)})相同(res1, res2) ##不需要考虑分配!set.seed(123)系统。时间({res3 <- sapply(1:n,函数(i) rnorm(1))})相同(res1, res3) ``` _Vectorize_ your own scripts ```{r} n <- 2000000 ## iteration: n calls to `rnorm(1)` set.seed(123) system.time({ res1 <- sapply(1:n, function(i) rnorm(1)) }) ## vectorize: 1 call to `rnorm(n)` set.seed(123) system.time( res2 <- rnorm(n) ) identical(res1, res2) ``` _Reuse_ other people's efficient code. - E.g., `limma::lmFit()` to fit 10,000's of linear models very quickly Examples in the lab this afternoon, and from Tuesday evening # Use chunk-wise iteration Example: nucleotide frequency of mapped reads ## An example without chunking Working through example data; not really large... ```{r, message = FALSE} library(RNAseqData.HNRNPC.bam.chr14) fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] basename(fname) Rsamtools::countBam(fname) ``` Input into a `GAlignments` object, including `seq` (sequence) of each read. ```{r, message = FALSE} library(GenomicAlignments) param <- ScanBamParam(what = "seq") galn <- readGAlignments(fname, param = param) ``` Write a function to determine GC content of reads ```{r, message = FALSE} library(Biostrings) gc_content <- function(galn) { seq <- mcols(galn)[["seq"]] gc <- letterFrequency(seq, "GC", as.prob=TRUE) as.vector(gc) } ``` Calculate and display GC content ```{r} param <- ScanBamParam(what = "seq") galn <- readGAlignments(fname, param = param) res1 <- gc_content(galn) hist(res1) ``` ## The same example with chunking Open file for reading, specifying 'yield' size ```{r} bfl <- BamFile(fname, yieldSize = 100000) open(bfl) ``` Repeatedly read chunks of data and calculate GC content ```{r} res2 <- numeric() repeat { message(".") galn <- readGAlignments(bfl, param = param) if (length(galn) == 0) break ## inefficient copy of res2, but only a few iterations... res2 <- c(res2, gc_content(galn)) } ``` Clean up and compare approaches ```{r} close(bfl) identical(res1, res2) ``` # Use (classical) parallel evaluation Many down-sides - More complicated code, e.g., to distribute data - Relies on, and requires mastery of, supporting infrastructure Maximum speed-up - Proportional to number of parallel computations - In reality - Cost of data movement from 'manager' to 'worker' - Additional overhead of orchestrating parallel computations ## [BiocParallel][] ```{r, echo=FALSE} xx <- gc(); xx <- gc() ``` ```{r, message = FALSE} fun <- function(i) { Sys.sleep(1) # a time-consuming calculation i # and then the result } system.time({ res1 <- lapply(1:10, fun) }) library(BiocParallel) system.time({ res2 <- bplapply(1:10, fun) }) identical(res1, res2) ``` - 'Forked' processes (non-Windows) - No need to distribute data from main thread to workers - Independent processes - Classic clusters, e.g., _slurm_ - Coming: cloud-based solutions ## [GenomicFiles][] Parallel, chunk-wise iteration through genomic files. Set up: ```{r, message = FALSE} library(GenomicFiles) ``` Define a `yield` function that provides a chunk of data for processing ```{r} yield <- function(x) { param <- ScanBamParam(what = "seq") readGAlignments(x, param = param) } ``` Define a `map` function that transforms the input data to the desired result ```{r} map <- function(x) { seq <- mcols(x)[["seq"]] gc <- letterFrequency(seq, "GC", as.prob = TRUE) as.vector(gc) } ``` Define a `reduce` function that combines two successive results ```{r} reduce <- c ``` Perform the calculation, chunk-wise and in parallel ```{r} library(RNAseqData.HNRNPC.bam.chr14) fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] bfl <- BamFile(fname, yieldSize = 100000) res <- reduceByYield(bfl, yield, map, reduce, parallel = TRUE) hist(res) ``` [BiocParallel]: //www.anjoumacpherson.com/packages/BiocParallel [GenomicFiles]: //www.anjoumacpherson.com/packages/GenomicFiles # Query specific values # Provenance ```{r} sessionInfo() ```