简介

Corces等人在13种不同的人类原发性造血细胞类型中测量了RNA-seq和ATAC-seq。在这里,我们探索这些数据,与散装GTEx血液RNA-seq相比,是否允许我们预测NFE2的调控。我们使用相同的渐进方法来选择候选转录因子:所有注释的tf,所有带基序的tf,严格或宽松结合和序列保守评分的tf。

Corces等人,2016,红细胞生成中FACS分类细胞的RNA-seq

表达加上所有已知的转录因子

基因本体论项目注释了1663个人类基因的分子功能dna结合转录因子活性

> tfs。all <- sort(unique(select(org.Hs.eg.db, keys="GO:0003700", keytype="GOALL", columns="SYMBOL")$SYMBOL)) > length(tfs.all) # 1663 >目标。gene <- "NFE2" > tfs <- intersect(tfs. t)所有,rownames(mtx.corces)) > length(tfs) # 1534 > solver <- ensemble blesolver (mtx.corces))corces,目标。gene, tfs, geneCutoff=1.0) > tbl <- run(求解器)> dim(tbl) # 1530 8 > new。order <- order(abs(tbl$pearsonCoeff), deleting =TRUE) > tbl <- tbl[新。order,] > rownames(tbl) <- NULL > tbl。goAll <- tbl # [1:100,] > head(tbl。goAll,n=20) gene betaLasso lasopvalue pearsonCoeff rfScore betaRidge spearmanCoeff xgboost 1 LMO2 0.047541125 5.249385e-20 0.8101163 8.702037e+00 0.007119946 0.6555403 1.500191e-01 2 NFIX 0.093098353 5.384999e-20 0.8099891 6.852083e+00 0.008154300 0.7870852 5.241650e-02 3 ZNF80 -0.066366470 6.490916e-20 -0.8096562 1.052108e+01 -0.008193182 -0.7093304 5.867285e-01 4 GFI1B 0.226649379 2.578638e-19 0.8016940 2.483361e+00 0.009161609 0.7929971 4.310633e-08 5 MITF 0.177632172 5.326661e-19 0.80092086.091928e+00 0.008086801 0.6164877 6.660835e-05 6 MAFG 0.057656557 1.079359e-16 0.7905661 1.711896e-04 7 GATA2 0.000000000 2.166205e-03 0.7691669 1.720552e+00 0.007813329 0.6851964 9.772821e-07 8 IKZF3 0.000000000 1.000000e+00 -0.7620993 4.786622e+00 -0.006253381 -0.3980242 1.528179e-06 9 LYL1 0.006733866 1.179134e +00 0.006740238 0.7822108 0.000000e+00 10 HOXA5 0.000000000 1.000000e+00 0.7312527 1.621898e+00 0.005573155 0.48799021.384144e-07 11 HOXA10 0.000000000 - 1.000000e -01 0.004681578 0.5537364 0.000000e+00 12 SP140 0.000000000 -0.7127619 1.478112e+00 -0.006094842 - 0.5237590 0.000000e+00 13 ETS1 -0.020158108 4.193672e-11 -0.7070259 4.233599e+00 -0.007438296 -0.5267590 0.000000e+00 14 NR6A1 0.000000000 1.000000e+00 0.6971826 6.104783e-02 0.004443842 0.5772753 0.000000e+00 15 BCL11B 0.000000000 1.000000e+00 -0.6951305 6.063338e -0.005375201 -0.4351164 0.000000e+00 16 IRF4-0.012852384 2.79651e -10 -0.007734237 -0.5694599 0.000000e+00 17 CEBPA 0.012159852 2.784620e-10 0.6864443 4.485537e-01 0.007222885 0.5459188 7.62646e -04 18 MEIS1 0.000000000 1.000000e+00 0.6815312 1.375627e-01 0.003999963 0.5606262 0.000000e+00 19 CBFA2T3 0.000000000 1.000000e+00 0.6809550 3.927459e-03 0.005125665 0.5737458 4.681401e-06 20 SMAD1 0.000000000 4.842264e-02 0.6758645 0.5366757 0.000000e+00
> match(c("GATA1", "TAL1", "KLF1"), tbl.goAll$gene) [1] 116 38 51

表达加上那些已知基序的转录因子

结合JASPAR 2018和Hocomoco转录因子纲要,确定了780个注释的转录因子motif。在构建下一个模型时,候选转录因子仅限于这一组。

> tfs.with.motifs <- sort(unique(mcols(query(MotifDb, c("sapiens"), c("jaspar2018", "hocomoco")))$geneSymbol)) > length(tfs.with.motifs) [1] 780 > tfs <- intersect(tfs.with. motifs)motifs, rownames(mtx.corces)) > length(tfs) [1] 509 > > solver <- EnsembleSolver(mtx. corces))corces,目标。gene, tfs, geneCutoff=1.0, solverNames=solverNames) > suppressWarnings(> tbl <- run(求解器)>)> dim(tbl) # 507 8 > new。order <- order(abs(tbl$pearsonCoeff), deleting =TRUE) > tbl <- tbl[新。order,] > rownames(tbl) <- NULL > tbl。withMotifs <- tbl # [1:100,] > head(tbl。withMotifs,n=20 gene betaLasso lasopvalue pearsonCoeff rfScore betaRidge spearmanCoeff xgboost 1 NFIX 0.10626036 5.375931e-20 0.8099891 10.93402127 0.7870852 5.449019e-01 2 GFI1B 0.25679512 3.100623e-19 0.8016940 0.88338652 0.022793570 0.7929971 2.273725e-02 3 MITF 0.20075697 2.815766e-19 0.8009208 7.83112670 0.016352198 0.6164877 1.779597e-01 4 MAFG 0.06992969 2.477448e-17 0.7905432 5.69753582 0.017553696 0.7665661 3.3169636e -05 5 GATA2 0.00687861 6.610933e-02 0.99289713 0.0169636650.6851964 5.200143e-05 6 HOXA5 0.00000000 1.000000e+00 0.7312527 3.05457854 0.4879902 0.000000e+00 7 HOXA10 0.00000000 1.420628e-01 0.7262372 0.75941803 0.014015820 0.5537364 6.400621e-06 8 ETS1 -0.06970416 3.556011e-13 -0.7070259 8.07010777 -0.018193677 -0.5267590 0.000000e+00 9 NR6A1 0.00000000 - 2.461098e-01 0.6971826 0.41803310 0.011750808 0.5772753 0.000000e+00 10 IRF4 -0.02169888 1.360509e -0.020261614 -0.5694599 0.000000e+00 11 CEBPA 0.034822039.188178e-11 0.6864443 0.69499215 0.014358728 0.5459188 1.349195e-04 12 MEIS1 0.00000000 1.000000e+00 0.6815312 0.33666404 0.012905598 0.5606262 0.000000e+00 SMAD1 0.00000000 4.204800e-02 0.6758645 3.70105486 0.016631861 0.5366757 1.032892e-08 14 TFEC 0.00000000 1.000000e+00 0.6740365 1.17192544 0.013276577 0.4344201e -01 15 RFX2 0.00000000 1.000000e+00 0.6670588 0.03802791 0.011815788 1.071266e-06 16 MYBL1 0.00000000 1.000000e+00 -0.6585463 1.31516074 -0.014894704-0.4524598 0.000000e+00 17 MYCN 0.00000000 1.000000e+00 0.6554632 0.09194602 0.014752993 0.5671422 2.8409133e -06 18 TBX21 0.00000000 1.000000e+00 -0.6442284 0.28941161 -0.013417225 -0.4703482 4.371519e-08 19 FOSB 0.00000000 1.000000e+00 0.6433983 0.36776177 0.011620625 0.4334291 0.000000e+00 0.6353066 0.14183513 0.009995614 0.279443e -07
> match(c("GATA1", "TAL1", "KLF1"), tbl.withMotifs$gene) [1] 58 21 28

表达加上高度保守,高得分转录因子在20kb调控区域

我们假设在靶基因TSS +/- 10kb高度保守的调控区域中发现的具有良好匹配基序的转录因子结合位点可能是功能性的。当发现,当tf/靶基因表达也相关或反相关时,这些可能是有用的trena预测,值得进一步考虑。

在这里,我们使用了一个预先计算的NFE2转录起始位点周围20kb的FIMO和phast7评分表,只提取那些匹配和保守性非常高的tf。有了这些数据和假设,GATA1在模型中的排名上升到第8位,皮尔逊相关系数为0.5。符合预期和已发表论文的发现。

phast。得分<- 0.90 tbl.fimo.strong <-子集(tbl. fimo.strong)fimoMotifs, p.value <= fimo。score & phast7 >= phast7 .score) dim(tbl.fimo.strong) tfs <- sort(unique(tbl.fimo.strong$tf)) length(tfs) # 52 solver <- EnsembleSolver(mtx. fimo.strong)corces,目标。gene, tfs, geneCutoff=1.0, solverNames=solverNames) tbl <- run(求解器)dim(tbl) new。order <- order(abs(tbl$pearsonCoeff), deleting =TRUE) tbl <- tbl[新。order,] rownames(tbl) <- NULL tbl.corces.fimo <- tbl head(tbl.corces.fimo) gene betaLasso lassoPValue pearsonCoeff rfScore betaRidge spearmanCoeff xgboost 1 NR6A1 0.11363323 4.828668e-13 0.6971826 9.0740837 0.086916431 1.054292e-02 2 IRF4 -0.20320953 7.953286e-13 -0.6919585 9.9411368 -0.119304407 -0.5694599 2.641889e-02 3 CEBPA 0.24585247 2.816974e-12 0.6864443 14.2590463 0.091246714 0.5459188 5.745650e-01 4 TAL1 0.03548830 1.393689e-08 0.6295745 6.9310420 0.098972619 0.64797812.837518e-02 5 KLF1 0.21135723 1.836624e-10 0.6101488 0.099546351 0.7398168 5.372569e-02 6 EGR1 0.03228123 2.758436e-06 0.5675901 4.7301588 0.072134395 0.4494879 2.970836e-03 7 KLF4 0.03207437 1.354203e-05 0.5561067 6.2918779 0.3475569 4.553295e-04 8 GATA1 0.00000000 8.364670e-01 0.5002867 1.2874843 0.071154365 0.5965923 2.693193e-01 9 SPI1 0.00000000 3.861188e-03 0.4970352 1.454362943 0.4420989 3.514533e-05 10 WT1 0.00000000 3.669754e-03 0.4620940 0.77242660.066273775 0.3997158 2.638663e-03 11 MAZ 0.00000000 8.521497e-01 0.4337519 1.1328829 0.037589727 0.5226550 3.047602e-03 12 KLF16 0.00000000 7.2584461e -01 0.3626252 0.9036973 0.018445763 0.4697260 1.414089e-03 13 NFIC 0.00000000 6.982975e-01 0.3594703 0.7852467 0.022511178 0.4386880 4.238493e-05 14 SP4 0.00000000 6.701522e-01 -0.3519025 0.9325721 -0.040376487 - 0.2032425 3.356494e-04 15 RARA 0.00000000 6.907189e-01 0.344835 0.012695432 0.3050455 1.760672e-06 16 KLF8 0.000000003.740316e-02 -0.2703237 -0.058074080 -0.1528057 8.541978e-05 17 SP1 0.00000000 3.059809e-01 0.2605265 0.3552120 0.038543500 0.3223789 1.450075e-04 18 MNT 0.00000000 4.523336e-01 0.2338399 0.5426188 0.001283042 0.3682946 6.916575e-03 19 TFCP2 0.00000000 9.849067e-01 0.2283194 0.8576896 0.026733904 1.616500e-03 20 STAT3 0.00000000 9.734015e-01 0.2222853 0.011888279 0.4048896 2.600591e-03

这三种转录因子现在都被发现是模型中的顶级调控因子:

> match(c("GATA1", "TAL1", "KLF1"), tbl. corcs .fimo$gene) # 8 4 5 [1] 8 4 5

Cusanovich 2014确立了功能转录因子往往具有

我们的启发式方法是只选择非常高的守恒和序列匹配,但人们普遍认为TF绑定比这更加混杂。因此,现在我们在模型表中添加两列,显示严格和宽松主题/保护评分的绑定位点计数。在模型中排名较高的tf被授予额外的信任,并且通过一种或两种测量方法具有多个结合位点。

tfbs.strong计数为phast7保护评分(负鼠-灵长类)> 0.90和FIMO motif匹配< 1e-5的位点。

tfbs.weakphast7 > 0.5和FIMO < 1e-4。

gene betaLasso lasopvalue pearsonCoeff rfScore betaRidge spearmanCoeff xgboost tfbs。强大的tfbs。弱1 NR6A1 e-13 4.83 0.114 0.697 7.95 -0.203 9.074 0.087 0.577 0.011 1 2 3 IRF4 e-13 -0.692 9.941 -0.119 -0.569 0.026 1 9 3 CEBPA e-12 2.82 0.246 0.686 14.259 0.091 0.546 0.575 1 9 4 TAL1 e-08 1.39 0.035 0.630 6.931 0.099 0.648 0.028 1 2 5 KLF1 0.211 0.610 6.682 0.100 0.740 0.054 1.84平台以及4 28 6 EGR1 e-06 2.76 0.032 0.568 4.730 0.072 0.032 - 1.35 0.449 - 0.003 2 11 7 KLF4 e-05 0.556 6.292 0.068 0.348 - 0.000 0.000 - 8.36 2 4 8 GATA1 e-01 0.500 1.287 0.071 0.597 0.269 1 4 9 SPI1 0.000 3.86 0.497 - 1.454 e 030.000 - 3.67 0.057 - 0.443 0.000 - 2 9 10 WT1 e 03 0.462 0.772 0.066 0.400 - 0.003 0.000 - 8.52 2 4 11玛斯e-01 0.434 1.133 0.038 0.523 0.003 1 6 12 KLF16 e-01 7.26 0.000 0.363 0.904 0.018 0.470 - 0.001 0.000 - 6.98 2 7 13 NFIC e-01 0.359 0.785 0.023 0.000 - 6.70 0.439 - 0.000 4 36 14 SP4 e-01 -0.352 0.933 -0.040 -0.203 0.000 1 3 15 RARA e-01 6.91 0.000 0.279 0.434 0.013 0.305 0.000 - 3 12 16 KLF8 e-02 3.74 0.000 -0.270 1.202 -0.058 -0.153 0.000 - 1 10 17 SP1 e-01 3.06 0.000 0.261 0.355 0.039 0.000 0.322 0.000 1 3 18 MNT4.52e-01 0.234 0.543 0.001 0.368 0.007 1 3 19 TFCP2 0.000 9.85e-01 0.228 0.858 0.026 0.267 0.002 1 6 20 STAT3 0.000 9.73e-01 0.222 1.293 0.012 0.405 0.003 4 30