内容

AnnotationHub
作者: Martin Morgan [cre], Marc Carlson [ctb], Dan Tenenbaum [ctb], Sonali Arora [ctb]
修改:孙军28 10:41:23 2015
编译: 2017年10月17日星期二19:18:58

1获取基因组级数据

1.1非模式生物基因注释

Bioconductor提供预先构建的org . *模型生物的注释包,其用法见OrgDb注释工作流程的部分。这里我们发现OrgDb用于非典型生物的对象

library(注解hub) ah <-注解hub ()
## snapshotDate(): 2017-04-25
查询(啊,“OrgDb”)
##注释中心与940条记录## # snapshotDate(): 2017-04-25 ## $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ ## # $物种:大肠杆菌,棘阿米巴castellanii_str。neff, Acanthisitta chloris, Acetobac…# # # # # # $ rdataclass: OrgDb额外mcols (): taxonomyid,基因组,描述,coordinate_1_based,维护者,# # # rdatadateadded, preparerclass,标签,rdatapath, sourceurl, sourcetype # # #检索记录,例如,[[“AH53757”对象 "]]' ## ## 标题# # AH53757 | org.Ag.eg.db.sqlite # # AH53758 | org.At.tair.db.sqlite # # AH53759 | org.Bt.eg.db.sqlite # # AH53760 | org.Cf.eg.db.sqlite # # AH53761 | org.Gg.eg.db.sqlite  ## ... ...热等离子体(thermoplasmatales_archaeon_brna .)n . n . (n .) n . (n .) n . (n .)脱硫球菌& amylolyticus_dsm_16532 .eg;sqlite ## AH56652 | org. pandoravirus_dulis .eg。sqlite ## AH56653 | org.Methanocaldococcus_infernus_ME.eg.sqlite
orgdb <- query(ah, " orgdb ")[[1]]
##加载从缓存'/home/biocbuild//AnnotationHub / 60495”

方法返回的对象可以直接使用select ()接口,例如,发现可用的键类型查询对象,这些键类型可以映射到的列,最后选择符号和GENENAME对应的前6个entrezid

keytypes (orgdb)
## [13] " accnum " " ensembl " " ensemblprot " " entrezid " " enzyme " ## [7] " evidence " " evidenceall " " genename " " go " " goall " " ontology " ## [13] " ontologyall " " path " " midid " " refseq " " symbol " " unigene " ## [19] " uniprot "
列(orgdb)
## [13] " accnum " " ensembl " " ensemblprot " " entrezid " " enzyme " ## [7] " evidence " " evidenceall " " genename " " go " " goall " " ontology " ## [13] " ontologyall " " path " " midid " " refseq " " symbol " " unigene " ## [19] " uniprot "
egid <- head(keys(orgdb, "ENTREZID")) select(orgdb, egid, c("SYMBOL", "GENENAME"), "ENTREZID")
## 'select()'返回键和列之间的1:1映射
GENENAME ## 1 1267437 AgaP_AGAP012606 AGAP012606-PA ## 2 1267439 AgaP_AGAP012559 agap012559 AGAP012559-PA ## 3 1267440 AgaP_AGAP012558 AGAP012558-PA ## 4 1267447 AgaP_AGAP012586 AGAP012586-PA ## 5 1267450 AgaP_AGAP012834 AGAP012834-PA ## 6 1267459 AgaP_AGAP012589 AGAP012589-PA ##

1.2路线图表观基因组学项目

所有路线图表观基因组文件都是托管的在这里.如果一个人必须自己下载这些文件,他会通过web界面导航来找到有用的文件,然后使用如下的东西R代码。

url <- "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E001-H3K4me1.broadPeak.gz" filename <- basename(url)下载。File (url, destfile=filename) if (File .exists(filename)) data <- import(filename, format="bed")

对于所有文件都必须重复此操作,而识别、下载、导入和管理这些文件的本地磁盘位置的责任将落在用户身上。

AnnotationHub将此任务简化为几行R代码

library(注解hub) ah =注解hub ()
## snapshotDate(): 2017-04-25
epiFiles <- query(啊,"EpigenomeRoadMap")

返回的值epiFiles向我们显示18248路线图资源可通过AnnotationHub.关于文件的其他信息也可用,例如,文件来自哪里(数据提供者),基因组,物种,sourceurl,源类型。

epiFiles
##带18248条记录的注解中心## # snapshotDate(): 2017-04-25 ## $dataprovider: BroadInstitute ## # $species:智人## # $rdataclass: BigWigFile, GRanges, data.frame ## #附加mcols():taxonomyid, genome, description, coordinate_1_based, maintainer, ## # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype ## #检索记录,例如,'object[["AH28856"]]]' ## ## title ## AH28856 | E001-H3K4me1.broadPeak.gz ## AH28857 | E001-H3K4me3.broadPeak.gz ## AH28858 | e001 - h3k9acs . broadpeak .gz ## AH28859 | E001-H3K9me3.broadPeak.gz ## AH28860 | E001-H3K9me3.broadPeak.gz ## ... ...部分甲基化。部分甲基化。部分甲基化。部分甲基化。bigwig的用法和样例

为了确保我们只有来自Roadmap Epigenomics项目的文件,一个很好的完整性检查是检查返回的较小hub对象中的所有文件都来自智人还有hg19基因组

独特的(epiFiles物种美元)
“智人”
独特的(epiFiles基因组美元)
##[1]“hg19”

一般来说,通过查看sourcetype,可以从这个项目中了解不同的文件

表(epiFiles sourcetype美元)
## ## BED BigWig GTF Zip标签## 8298 9932 3 14 1 .

为了更详细地了解这些不同的文件,可以使用以下方法:

排序(表(epiFiles描述)美元,减少= TRUE)
## ## Bigwig文件包含EpigenomeRoadMap项目的-log10(p值)信号轨迹## 6881 ## Bigwig文件包含EpigenomeRoadMap项目的折叠富集信号轨迹## 2947 ## EpigenomeRoadMap项目的整合表观基因组的窄ChIP-seq峰值## 2894 ## EpigenomeRoadMap项目的整合表观基因组的宽ChIP-seq峰值## 2534 ## EpigenomeRoadMap项目的整合表观基因组的窄ChIP-seq峰值## 2534 ##整合窄dnnasepeaksepigenomes from EpigenomeRoadMap Project ## 131 ## 15 state chromatin segmentations from EpigenomeRoadMap Project ## 127 ## Broad domains on enrichment for DNase-seq for consolidated epigenomes from EpigenomeRoadMap Project ## 78 ## RRBS fractional methylation calls from EpigenomeRoadMap Project ## 51 ## Whole genome bisulphite fractional methylation calls from EpigenomeRoadMap Project ## 37 ## MeDIP/MRE(mCRF) fractional methylation calls from EpigenomeRoadMap Project ## 16 ## GencodeV10 gene/transcript coordinates and annotations corresponding to hg19 version of the human genome ## 3 ## RNA-seq read count matrix for intronic protein-coding RNA elements ## 2 ## RNA-seq read counts matrix for ribosomal gene exons ## 2 ## RPKM expression matrix for ribosomal gene exons ## 2 ## Metadata for EpigenomeRoadMap Project ## 1 ## RNA-seq read counts matrix for non-coding RNAs ## 1 ## RNA-seq read counts matrix for protein coding exons ## 1 ## RNA-seq read counts matrix for protein coding genes ## 1 ## RNA-seq read counts matrix for ribosomal genes ## 1 ## RPKM expression matrix for non-coding RNAs ## 1 ## RPKM expression matrix for protein coding exons ## 1 ## RPKM expression matrix for protein coding genes ## 1 ## RPKM expression matrix for ribosomal RNAs ## 1

路线图表观基因组学项目提供的“元数据”也可用。注意,显示具有单个资源的集线器的信息与引用多个资源时显示的信息有很大不同。

元数据。tab <- query(ah, c("EpigenomeRoadMap", "Metadata")) Metadata .tab
##注释中心与1记录## # snapshotDate(): 2017-04-25 ## # names(): AH41830 ## # $dataprovider: BroadInstitute ## # $species: Homo sapiens ## # $rdataclass: data.frame ## # $title: EID_metadata。## $description:元数据的EpigenomeRoadMap项目## # $taxonomyid: 9606 ## $genome: hg19 ## $sourcetype: tab ## # $sourceurl: http://egg2.wustl.edu/roadmap/data/byFileType/metadata/EID_metadata.tab ## # $sourcesize: 18035 ## $tags: c(“EpigenomeRoadMap”,“元数据”)## #检索记录与对象[["AH41830"]]]'

到目前为止,我们一直在探索有关资源的信息,而没有将资源下载到本地缓存并导入到r中[[如show方法末尾所示

##加载从缓存'/home/biocbuild//AnnotationHub / 47270”
元数据。tab <- ah[["AH41830"]]
##加载从缓存'/home/biocbuild//AnnotationHub / 47270”

元数据。返回为data.frame.前5列的前6行显示在这里:

元数据。选项卡(1:6,1:5)
## eid组颜色记忆符std_name ## 1 e001 esc #924965 esc。I3 ES-I3 cell ## 2 E002 ESC #924965 ESC。WA7 ES-WA7 cell ## 3 E003 ESC #924965 ESC。H1 H1 Cells ## 4 E004 ES-deriv #4178AE ESDR.H1.BMP4。ES-deriv #4178AE ESDR.H1.BMP4衍生的中胚层培养细胞## 5TROP H1 BMP4来源的滋养层培养细胞## 6 E006 ES-deriv #4178AE ESDR.H1。MSC H1衍生间充质干细胞

您可以使用多个参数继续构造不同的查询,以精简这些18248以获得所需的文件。例如,要获得整合表观基因组的ChIP-Seq文件,可以使用

bpChipEpi <- query(啊,c("EpigenomeRoadMap", "broadPeak", "chip", "consolidated"))

要获得所有bigWig信号文件,可以使用

allBigWigFiles <- query(啊,c("EpigenomeRoadMap", "BigWig"))

要访问15个状态染色质片段,可以使用

seg <- query(ah, c("EpigenomeRoadMap", " segments "))

如果有兴趣获得与一个示例相关的所有文件

E126 <- query(ah, c("EpigenomeRoadMap", "E126", "H3K4ME2")
##注释中心与6条记录## # snapshotDate(): 2017-04-25 ## # $dataprovider: BroadInstitute ## # $species: Homo sapiens ## # $rdataclass: BigWigFile, GRanges ## #附加mcols():taxonomyid, genome, description, coordinate_1_based, maintainer, ## # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype ## #检索记录,例如,'object[["AH29817"]]]' ## ## title ## AH29817 | E126-H3K4me2.broadPeak.gz ## AH30868 | e126 - h3k4me2 .窄峰.gz ## AH31801 | E126-H3K4me2.gappedPeak.gz ## AH32990 | E126-H3K4me2.fc.signal。bigwig ## AH34022 | E126-H3K4me2.pval.signal。bigwig ## AH40177 | e126 - h3k4me2 . imputd .pval.signal.bigwig

还可以使用子集(),显示();看主要内容AnnotationHub装饰图案更多细节。

根据需要导入中心资源Bioconductor对象,用于进一步分析。例如,峰值文件返回为农庄对象。

# #要求(“rtracklayer”)
##加载从缓存'/home/biocbuild//AnnotationHub / 35257”
峰值<- E126[['AH29817']]
##加载从缓存'/home/biocbuild//AnnotationHub / 35257”
seqinfo(峰值)
来自hg19基因组的93个序列(1个循环)的Seqinfo对象:## seqnames seq长度isCircular genome ## chr1 249250621 FALSE hg19 ## chr2 243199373 FALSE hg19 ## chr3 198022430 FALSE hg19 ## chr4 191154276 FALSE hg19 ## chr5 180915260 FALSE hg19 ## ... ... ... ...## chrUn_gl000245 36651 FALSE hg19 ## chrUn_gl000246 38154 FALSE hg19 ## chrUn_gl000247 36422 FALSE hg19 ## chrUn_gl000248 39786 FALSE hg19 ## chrUn_gl000249 38502 FALSE hg19

BigWig文件返回为BigWigFile对象。一个BigWigFile是对磁盘上文件的引用;文件中的数据可以在使用时读取rtracklayer:进口(),也许可以在帮助页上查询这些大文件中感兴趣的特定基因组区域import.bw ?

里面的每条记录AnnotationHub与唯一标识符关联。大多数农庄返回的对象。AnnotationHub属性所在资源的唯一AnnotationHub标识符农庄是派生的。属性时,这可以派上用场农庄对象,以及正在使用的对象的附加信息(例如,缓存中文件的名称,或资源底层数据的原始sourceurl)。

元数据(峰值)
## $AnnotationHubName ## [1] "AH29817" ## ## $ ' File Name ' ## [1] " e126 - h3k4me2 . broadtop .gz" ## ## $ ' Data Source ' ## [1] "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E126-H3K4me2.broadPeak.gz" ## ## $Provider ## [1] "BroadInstitute" ## ## $Organism ##[1] "智人" ## ## $ ' Taxonomy ID ' ## [1] 9606
啊(元数据(山峰)AnnotationHubName美元)sourceurl美元
##[1]“http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E126-H3K4me2.broadPeak.gz”

1.3用于TxDb基因模型和序列查询的集成GTF和FASTA文件

Bioconductor使用“转录本”数据库表示基因模型。这些可以通过包获得,例如TxDb.Hsapiens.UCSC.hg38.knownGene或者可以使用函数来构造GenomicFeatures::makeTxDbFromBiomart ()

AnnotationHub提供了一种简单的方法来处理由Ensembl发布的基因模型。让我们看看ensemble的Release-80有什么河豚鱼的数据,Takifugu摘要

查询(啊,c("泷豚","release-80"))
## AnnotationHub与7条记录## # snapshotDate(): 2017-04-25 ## $dataprovider: Ensembl ## # $species: Takifugu rubripes ## # $rdataclass: FaFile, GRanges ## #附加mcols(): taxonomyid,基因组,描述,coordinate_1_based, maintainer, ## # rdatadateadded,准备类,标签,rdatapath, sourceurl, sourcetype ## #检索记录,例如,'object[["AH47101"]]]' ## ## title ## AH47101 | Takifugu_rubripes.FUGU4.80。gtf ## AH47475 |竹fugu_rubrips . fugu4 .cdna.all。fa ## AH47476 | takifugu_rubrips . fugu4 .dna_rm.toplevel。fa ## AH47477 | takifugu_rubrips . fugu4 .dna_sm. topllevel。takifugu_rubrips . fugu4 .dna.toplevel。fa ## AH47479 |泷fugu_rubrips . fugu4 .ncrna。fa ## AH47480 |泷fugu_rubrips . fugu4 .pep.all.fa

我们看到有一个描述基因模型的GTF文件,以及各种DNA序列。找回GTF和顶级DNA序列文件。导入的GTF文件为农庄例如,DNA序列作为一个压缩,索引Fasta文件

gtf <- ah[["AH47101"]]
##加载从缓存'/home/biocbuild//AnnotationHub / 52579”
##使用猜测工作填充seqinfo
dna <- ah[["AH47477"]]
##加载从缓存'/home/biocbuild//AnnotationHub/53323' ## '/home/biocbuild//.AnnotationHub/53324'
头(gtf), 3)
GRanges对象,包含3个范围和19个元数据列:## seqnames range strand | source type score phase gene_id ##    |      ## [1] scaffold_1 [10422, 11354] - | ensembl gene   ENSTRUG00000003702 ## [2] scaffold_1 [10422, 11354] - | ensembl transcript   ENSTRUG00000003702 ## [3] scaffold_1 [10422, 11354,11354] - | ensembl外显子  ENSTRUG00000003702 gene_version gene_source biotype transcript_id transcript_version ## <数字> <字符> <字符> <字符> <数字> ## [1]1 ensembl protein_coding ENSTRUT00000008740 1 [3] 1 ensembl protein_coding ENSTRUT00000008740 1 ## > <字符> <数字> <字符> <数字> <字符> ## [1]     ## [2] ensembl protein_coding     ## [3] ensembl protein_coding 1 ENSTRUE00000055472 1  ## protein_version gene_name transcript_name ##    ## [1]    ## [2]    ## [3]    ## ------- ## seqinfo: 2056 sequences (1 circular) from FUGU4 genome; no seqlengths
dna
FaFile路径:/home/biocbuild//。AnnotationHub/53323 ## index: /home/biocbuild//.AnnotationHub/53324 ## isOpen: FALSE ## yieldSize: NA
头(seqlevels (dna))
## [1] "scaffold_1" "scaffold_2" "scaffold 3" "scaffold 4" "scaffold 5" "scaffold 6"

让我们找出25个最长的DNA序列,只在这些支架上留下注释。

Keep <- names(tail(sort(seqlength (dna)), 25)) gtf_子集<- gtf[seqnames(gtf) %in% Keep]

创建这个子集(或整个gtf)的TxDb实例很简单。

library(GenomicFeatures) # for makeTxDbFromGRanges txdb <- makeTxDbFromGRanges(gtf_子集)

并将其与DNA序列结合使用,例如,找到所有注释基因的外显子序列。

getSeq,FaFile-method exons <- exons(txdb) length(exons)
## [1] 66219
getSeq (dna,外显子)
一个长度为66219的DNAStringSet实例CCTGCAGGAGAGTCTGGACGAGCTTATCCAG scaffold_1 ## [2] 105 ACTCAGCAGATCACCCCTCAGCTGGCTCTCC…TAATCGTGTCCGCAACCGTGTGAACTTCAGG scaffold_1 ## [3] 156 ggttctctcaacctaccggttctgtgaca…88 CAAACCAATCTCCTCGCTGTCTCTTCTCGTT…agacgagatgagtgagacgcattcaacgcc…AACACAGTGTGGAGACTTCAGAGGACGCCAC scaffold_1 ## ... ... ...## [66215] 67 acgactggatgacaacatcaggaccgggggta…[66216] 50 TCTTTGGCTAATATTGACGATGTGGTAAACAAGATTCGTCTGAAGATTCG scaffold_9 ## [66217] 81 GTATTTCCCAGCCAAGACCCGCTGGACAGGG…atacatcaacactgtttcccaccgagcag scaffold_9 ## [66218] 87 atgatggaggagaagaatttgattgcgg…GACCCCAGAGGTGCAGCTAGCAATTGAACAG scaffold_9 ## [66219]GACAGCTGCTGTTCGCCTGTTTCCCCCCCCC scaffold_9

包含的基因组范围之间存在一对一的映射外显子和返回的DNA序列getSeq ()

在处理这部分组装的基因组时出现了一些困难,这需要更高级的基因组范围技能,请参阅GenomicRanges小插图,尤其是"GenomicRangesHOWTOs”和“介绍GenomicRanges”。

1.4在基因组构建之间进行映射

假设我们想要从一个基因组构建中提升特征到另一个基因组构建中,例如,因为注释是为hg19生成的,但我们的实验分析使用了hg18。我们知道UCSC为基因组构建之间的映射提供了“传输”文件。

在这个例子中,我们将选择宽广的Peak农庄从来自“hg19”基因组的E126,并将这些特征提升到它们的“hg38”坐标。

Chainfiles <- query(啊,c("hg38", "hg19", "chainfile")
##注释hub与2条记录## # snapshotDate(): 2017-04-25 ## $dataprovider: UCSC ## # $species: Homo sapiens ## # $rdataclass: ChainFile ## #附加mcols(): taxonomyid,基因组,描述,coordinate_1_based, maintainer, ## # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype ## #检索记录,例如,'object[["AH14108"]]]' ## ## title ## AH14108 | hg38ToHg19.over.chain.gz ## AH14150 | hg19ToHg38.over.chain.gz

我们对从hg19提升到hg38的特性的文件感兴趣,所以让我们使用它下载

##加载从缓存'/home/biocbuild//AnnotationHub / 18245”
chain <- chainfiles[['AH14150']]
##加载从缓存'/home/biocbuild//AnnotationHub / 18245”
##长度25 ##名称(25):chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8…chr18 chr19 chr20 chr21 chr22 chrX chrY chrM

使用执行liftOver操作rtracklayer: liftOver ()

库(rtracklayer) gr38 <- liftOver(峰值,链)

返回一个GRangeslist;更新结果的基因组,得到最终结果

基因组(gr38) <-“hg38”gr38
##长度153266的GRangesList对象:##[[1]]##具有1个范围和5个元数据列的GRanges对象:## seqnames ranges strand | name score signalValue pValue qValue ##    | <字符> <数字> <数字> <数字> <数字> ## [1]chr1 [28667912, 28670147] * | Rank_1 189 10.55845 22.01316 18.99911 ## ##[[2]] ##具有1个范围和5个元数据列的GRanges对象:## seqnames ranges strand | name score signalValue pValue qValue ## [1] chr4 [54090990,54092984] * | Rank_2 188 8.11483 21.80441 18.80662 ## ## [[3]] ## GRanges对象1范围和5元数据列:## seqnames ranges strand | name score signalValue pValue qValue ## | Rank_3 180 8.89834 20.97714 18.02816 ## ##…## <153263更多元素> ## ------- ## seqinfo:来自hg38基因组的23个序列;没有seqlengths

1.5使用dbSNP变体

人们也可能对研究具有医学价值的常见种系变异感兴趣。此资料可于NCBI

查询集线器中的dbDNP文件:

返回一个VcfFile在使用中可以读取哪些r Biocpkg(“VariantAnnotation”);因为VCF文件可能很大,readVcf ()支持几种策略,仅导入文件的相关部分(例如,特定的基因组位置,变体的特定特征),参见readVcf ?获取更多信息。

变体<- readVcf(vcf, genome="hg19")变体
##类:折叠dvcf ## dim: 111138 0 ## rowRanges(vcf): ## GRanges与5元数据列:paramRangeID, REF, ALT, QUAL, FILTER ## info(vcf): ## DataFrame与58列:RS, RSPOS, RV, VP, GENEINFO, dbSNPBuildID, SAO, SSR, WGT, VC, PM, T…## info(header(vcf)): ## RS 1 Integer dbSNP ID(即RS号)## RSPOS 1 Integer Chr position in dbSNP ## RV 0 Flag RS orientation is reversed ## VP 1 String Variation Property。文档在ftp://ftp.ncbi.nlm.nih.g…基因符号每对:基因id。基因符号和id是de…## dbSNPBuildID 1整数第一个dbSNP构建RS ## SAO 1整数变体等位基因起源:0 -未指定,1 -种系,2 -体细胞…SSR 1整数变量可疑原因代码(可能有多个值被添加到…## WGT 1整数权重,00 -未映射,1 -权重1,2 -权重2,3 -权重3 o…## vc1字符串变体类## PM 0标志变体是珍贵的(临床,Pubmed引用)## TPA 0标志临时第三方注释(TPA)(目前来自制药公司…## PMC 0标志存在到PubMed Central文章的链接## S3D 0标志具有3D结构- SNP3D表## SLO 0标志具有SubmitterLinkOut - From SNP->SubSNP->Batch。NSF 0标志具有非同义移码编码区域变化,其中一个… ## NSM 0 Flag Has non-synonymous missense A coding region variation where one a... ## NSN 0 Flag Has non-synonymous nonsense A coding region variation where one a... ## REF 0 Flag Has reference A coding region variation where one allele in the s... ## SYN 0 Flag Has synonymous A coding region variation where one allele in the ... ## U3 0 Flag In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53 ## U5 0 Flag In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55 ## ASS 0 Flag In acceptor splice site FxnCode = 73 ## DSS 0 Flag In donor splice-site FxnCode = 75 ## INT 0 Flag In Intron FxnCode = 6 ## R3 0 Flag In 3' gene region FxnCode = 13 ## R5 0 Flag In 5' gene region FxnCode = 15 ## OTH 0 Flag Has other variant with exactly the same set of mapped positions o... ## CFL 0 Flag Has Assembly conflict. This is for weight 1 and 2 variant that ma... ## ASP 0 Flag Is Assembly specific. This is set if the variant only maps to one... ## MUT 0 Flag Is mutation (journal citation, explicit fact): a low frequency va... ## VLD 0 Flag Is Validated. This bit is set if the variant has 2+ minor allele... ## G5A 0 Flag >5% minor allele frequency in each and all populations ## G5 0 Flag >5% minor allele frequency in 1+ populations ## HD 0 Flag Marker is on high density genotyping kit (50K density or greater)... ## GNO 0 Flag Genotypes available. The variant has individual genotype (in SubI... ## KGPhase1 0 Flag 1000 Genome phase 1 (incl. June Interim phase 1) ## KGPhase3 0 Flag 1000 Genome phase 3 ## CDA 0 Flag Variation is interrogated in a clinical diagnostic assay ## LSD 0 Flag Submitted from a locus-specific database ## MTP 0 Flag Microattribution/third-party annotation(TPA:GWAS,PAGE) ## OM 0 Flag Has OMIM/OMIA ## NOC 0 Flag Contig allele not present in variant allele list. The reference s... ## WTD 0 Flag Is Withdrawn by submitter If one member ss is withdrawn by submit... ## NOV 0 Flag Rs cluster has non-overlapping allele sets. True when rs set has ... ## CAF . String An ordered, comma delimited list of allele frequencies based on 1... ## COMMON 1 Integer RS is a common SNP. A common SNP is one that has at least one 10... ## CLNHGVS . String Variant names from HGVS. The order of these variants correspon... ## CLNALLE . Integer Variant alleles from REF or ALT columns. 0 is REF, 1 is the firs... ## CLNSRC . String Variant Clinical Chanels ## CLNORIGIN . String Allele Origin. One or more of the following values may be added: ... ## CLNSRCID . String Variant Clinical Channel IDs ## CLNSIG . String Variant Clinical Significance, 0 - Uncertain significance, 1 - no... ## CLNDSDB . String Variant disease database name ## CLNDSDBID . String Variant disease database ID ## CLNDBN . String Variant disease name ## CLNREVSTAT . String no_assertion - No assertion provided, no_criteria - No assertion ... ## CLNACC . String Variant Accession and Versions ## geno(vcf): ## SimpleList of length 0:

rowRanges ()返回VCF文件的CHROM, POS和ID字段的信息,表示为农庄实例

rowRanges(变异)
与111138年# #农庄对象范围和5元数据列:# # seqnames范围链| paramRangeID REF ALT # # < Rle > < IRanges > < Rle > | <因素> < DNAStringSet > < DNAStringSetList > # # rs786201005 1(1014143、1014143)* | < NA > C T # # rs672601345 1(1014316、1014316)* | < NA > C CG # # rs672601312 1(1014359、1014359)* | < NA > G T # # rs115173026 1(1020217、1020217)* | < NA > G T # # rs201073369 1(1020239、1020239)* | < NA > G C  ## ... ... ... ... . ... ... ...## rs527236200 MT [15943, 15943] * |  T c# # rs118203890 MT [15950, 15950] * |  G a# # rs199474700 MT [15965, 15965] * |  G a# # rs199474701 MT [15967, 15967] * |  G a# # rs199474699 MT [15990, 15990] * |  C t# # QUAL FILTER ## <数字> <字符> ## rs786201005 。## rs672601345 。## rs672601312 。## rs115173026 。## rs201073369 。## ... ... ...## rs527236200 。## rs118203890 。## rs199474700 。 ## rs199474701  . ## rs199474699  . ## ------- ## seqinfo: 25 sequences from hg19 genome; no seqlengths

注意,broadPeaks文件遵循UCSC染色体命名约定,vcf数据遵循NCBI风格的染色体命名约定。为了使这些范围具有相同的染色体命名约定(即UCSC),我们将使用

seqlevelsStyle(变异)< -seqlevelsStyle(峰值)

最后,为了找出哪些变体与这些broadPeaks重叠,我们将使用:

overlap <- findOverlaps(变体,峰值)重叠
## queryHits subjectHits ##   ## [1] 35 20333 ## [2] 36 20333 ## [3] 37 20333 ## [4] 38 20333 ## [5] 41 7733 ## # ... ... ...## [10900] 110761 21565 ## [10901] 110762 21565 ## [10902] 110763 21565 ## [10903] 110764 21565 ## [10904] 110765 21565 ## ------- ## # queryLength: 111138 / subjectLength: 153266

对于如何解释这些结果的一些见解来自于观察一个特定的峰,例如,第3852个峰

idx <- subjectHits(overlap) == 3852 overlap[idx]
##点击对象39次点击,0元数据列:## queryHits subjectHits ##   ## [1] 102896 3852 ## [2] 102897 3852 ## [3] 102898 3852 ## [4] 102899 3852 ## [5] 102900 3852 ## # ... ... ...## [35] 102930 3852 ## b[36] 102931 3852 ## b[38] 102933 3852 ## [39] 102934 3852 ## ------- ## queryLength: 111138 / subjectLength: 153266

有三个变体重叠在这个峰值上;峰值和重叠变量的坐标为

峰[3852]
## seqnames ranges strand | name score signalValue pValue qValue ##    |      ## [1] chr22 [50622494, 50626143] * | Rank_3852 79 6.06768 10.18943 7.99818 ## ------- # seqinfo: hg19基因组的93个序列(1个循环)
rowRanges(变异)[queryHits(重叠[idx])]
39 # #农庄对象范围和5元数据列:# # seqnames范围链| paramRangeID REF ALT # # < Rle > < IRanges > < Rle > | <因素> < DNAStringSet > < DNAStringSetList > # # rs6151429 chr22(50625049、50625049)* | < NA > T C # # rs6151428 chr22(50625182、50625182)* | < NA > C A T # # rs774153480 chr22(50625182、50625182)* | < NA > C CG, CGGGG # # rs199476388 chr22(50625204、50625204)* | < NA > C、G # # rs74315482 chr22(50625213、50625213)* | < NA > G  ## ... ... ... ... . ... ... ...## rs199476369 chr22 [50625936, 50625936] * |  C gg# # rs2071421 chr22 [50625988, 50625988] * |  T c# # rs74315475 chr22 [50626033,50626033] * |  T a# # rs398123419 chr22 [50626052, 50626052] * |  C a# # rs398123418 chr22 [50626057, 50626057] * |  G qal FILTER ## <数字> <字符> ## rs6151429 。## rs6151428 。## rs774153480 。## rs199476388 。## rs74315482 。## ... ... ...## rs199476369 。## rs2071421 。## rs74315475 。 ## rs398123419  . ## rs398123418  . ## ------- ## seqinfo: 25 sequences from hg19 genome; no seqlengths

2sessionInfo

sessionInfo ()
## R版本3.4.2(2017-09-28)##平台:x86_64-pc-linux-gnu(64位)##运行在Ubuntu 16.04.3 LTS下## ##矩阵产品:默认## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas。所以## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack。所以## ## locale: ## [1] LC_CTYPE=en_US。UTF-8 LC_NUMERIC=C LC_TIME=en_US。UTF-8 ## [4] LC_COLLATE=C LC_MONETARY=en_US。utf - 8 LC_MESSAGES = en_US。UTF-8 ## [7] LC_PAPER=en_US。UTF-8 LC_NAME=C LC_ADDRESS= c# # [10] lc_phone =C LC_MEASUREMENT=en_US。UTF-8 LC_IDENTIFICATION=C ## ##附加的基本包:## [1]stats4并行统计图形grDevices utils数据集方法基础## ##其他附加包:# # # # [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.2 [3] rtracklayer_1.36.6 VariantAnnotation_1.22.3 # # [5] SummarizedExperiment_1.6.5 DelayedArray_0.2.7 # # [7] matrixStats_0.52.2 Rsamtools_1.28.0 # # [9] Biostrings_2.44.2 XVector_0.16.0 # # [11] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2 # # [13] Biobase_2.36.2 GenomicRanges_1.28.6 # # [15] GenomeInfoDb_1.12.3 IRanges_2.10.5 # # [17] S4Vectors_0.14.7 AnnotationHub_2.8.3 # # [19] BiocGenerics_0.22.1 BiocStyle_2.4.1 # # # #加载通过名称空间(and not attached): ## [1] lattice_0.20-35 htmltools_0.3.6 yaml_2.1.14 ## [4] interactiveDisplayBase_1.14.0 blob_1.1.0 XML_3.98-1.9 ## [7] rlang_0.1.2 DBI_0.7 BiocParallel_1.10.1 ## [10] bit64_0.9-7 GenomeInfoDbData_0.99.0 stringr_1.2.0 ## [13] zlibbioc_1.22.0 memoise_1.1.0 evaluate_0.10.1 ## [16] knitr_1.17 biomaRt_2.32.1 httpuv_1.3.5 ## [19] BiocInstaller_1.26.1 curl_3.0 Rcpp_0.12.13 ## [22] xtable_1.8-2 backports_1.1.1 mime_0.5 ## [25] bit_1.1-12 digest_0.6.12 stringi_1.1.5 ## [28] shiny_1.0.5 rprojroot_1.2 grid_3.4.2 ## [31] tools_3.4.2 bitops_1.0-6 magrittr_1.5 ## [34] RCurl_1.95-4.8 tibble_1.3.4 RSQLite_2.0 ## [37] pkgconfig_2.0.1 Matrix_1.2-11 rmarkdown_1.6 ## [40] httr_1.3.1 R6_2.2.2 GenomicAlignments_1.12.2 ## [43] compiler_3.4.2