咱们《生信技能树》的B站有一个LncRNA数据分析实战视频课程 ,缺乏配套笔记。年前雇佣了一个实习生 系统性跟着我的课程做了一个中国人群的肾癌队列转录组测序数据的Lnc RNA 的组装和鉴定,并且形成了完善的笔记小册子。
前面我们跑完了hisat2和stringtie流程,拿到了组装好的gtf文件。详见:LncRNA鉴定上游分析
接下来就需要对组装好的gtf文件里面的lincRNA 进行一系列的评估和过滤操作。
Gffcompare 获取转录本组装情况 我使用的代码是:
gtf=$HOME /reference/human/gtf/gencode.v37.annotation.gtf nohup gffcompare -R -r $gtf -o ./merged ../05.stringtie/02.merge_gtf/stringtie_merged.gtf > gffcompare.log 2>&1 &# 查看比对结果的准确性和预测率。 cat merged.stats#= Summary for dataset: ../05.stringtie/02.merge_gtf/stringtie_merged.gtf # Query mRNAs : 433529 in 70170 loci (403256 multi-exon transcripts) # (26951 multi-transcript loci, ~6.2 transcripts per locus) # Reference mRNAs : 232728 in 56612 loci (207676 multi-exon) # Super-loci w/ reference transcripts: 50985 #-----------------| Sensitivity | Precision | Base level: 99.6 | 25.3 | Exon level: 85.7 | 58.0 | Intron level: 99.4 | 65.9 | Intron chain level: 98.7 | 50.8 | Transcript level: 97.4 | 52.3 | Locus level: 96.0 | 71.4 | Matching intron chains: 204982 Matching transcripts: 226593 Matching loci: 54373 Missed exons: 453/638637 ( 0.1%) Novel exons: 185652/937901 ( 19.8%) Missed introns: 667/388679 ( 0.2%) Novel introns: 100294/586077 ( 17.1%) Missed loci: 0/56612 ( 0.0%) Novel loci: 19167/70170 ( 27.3%) Total union super-loci across all input datasets: 70152 433529 out of 433529 consensus transcripts written in ./merged.annotated.gtf (0 discarded as redundant)# 统计class code 类型 awk '$3!~/class/ {print $3}' merged.stringtie_merged.gtf.tmap | sort -V | uniq -c 1075 c 1 e 23378 i 95142 j 21682 k 8203 m 22690 n 7640 o 201 p 27 s 13609 u 10689 x 1087 y 228105 =
得到如下文件:
total 607M 145 2月 18 20:24 gffcompare.log 488M 2月 18 20:24 merged.annotated.gtf 15M 2月 18 20:24 merged.loci 1.4K 2月 18 20:24 merged.stats 12M 2月 18 20:24 merged.stringtie_merged.gtf.refmap 45M 2月 18 20:24 merged.stringtie_merged.gtf.tmap 49M 2月 18 20:24 merged.tracking
接下来主要的操作对象是 merged.stringtie_merged.gtf.tmap 文件。
step1:保留指定class_code的transcripts 过滤,只保留class_code="u","x","i","j","o"的 transcripts ,这个时候需要参考 stringtie官网提供的分类 :
我使用的脚本:
#过滤,只保留class_code="u","x","i","j","o"的 transcripts awk '{if ($3=="u" || $3=="x" || $3=="i" || $3=="j" || $3=="o"){print $0}}' ~/lncRNA_project/06.gffcompare/merged.stringtie_merged.gtf.tmap > filter1_by_uxijo.tmap $ wc -l filter1_by_uxijo.tmap 150458 filter1_by_uxijo.tmap#获取剩余transcripts的exons位置信息,提取序列并组装成转录本序列 # 获取剩余的transcripts的ID awk '{print $5}' filter1_by_uxijo.tmap > filter1_transcript_ID $ wc -l filter1_transcript_ID 150458 filter1_transcript_ID# 剩余的transcripts得到gtf grep -w -Ff filter1_transcript_ID -w ~/lncRNA_project/06.gffcompare/merged.annotated.gtf > filter1_transcript.gtf# 把filter2_transcript.gtf中的class_code "=" 替换为L # 去除剩余为去除的class_code "=" awk -F ';' '{if ($8!=" L"){print $0}}' filter1_transcript.gtf > filter1 mv filter1 filter1_transcript.gtf $ wc -l filter1_transcript.gtf 1596482 filter1_transcript.gtf
step2: 根据长度进行过滤 过滤,只保留exon>1并且长度>200bp的transcripts
# 过滤,只保留exon>1并且长度>200bp的transcripts awk '($6>1 && $10>=200){print$0}' ../step1/filter1_by_uxijo.tmap > filter2_by_exon_length.tmap $ wc -l filter2_by_exon_length.tmap 145635 filter2_by_exon_length.tmap awk '{print $5}' filter2_by_exon_length.tmap > filter2_transcript_ID grep -Ff filter2_transcript_ID -w ../step1/filter1_transcript.gtf > filter2_transcript.gtf # 剩余transcripts的exon组成gtf awk '($3=="exon"){print$0}' filter2_transcript.gtf > filter2_transcript_exon.gtf# 根据exon位置信息提取基因组序列,组装成转录本序列 gffread -w filter2_transcript_exon.fa -g /home/data/server/reference/genome/hg38/hg38.fa ./filter2_transcript_exon.gtf $ grep -c "^>" filter2_transcript_exon.fa 145635
这个步骤的 保留exon>1的选择是值得商榷的, 也有很多流程里面,并不会做这个操作。
step3:转录本编码能力预测 转录本编码能力预测,主要是4个软件,需要取交集:
nohup cpat.py -x ../dat/Human_Hexamer.tsv \ -d ../dat/Human_logitModel.RData \ -g ~/lncRNA_project/07.identification/step2/filter2_transcript_exon.fa \ -o ~/lncRNA_project/07.identification/step3/CPAT/cpat_result.txt > cpat.log 2>&1 & nohup Rscript ./LncFinder.R > LncFinder.log 2>&1 & conda create -n py2test python=2.7 mamba create -n py2test python=2.7 conda install biopython=1.70 nohup python CPC2.py -i ~/lncRNA_project/07.identification/step2/filter2_transcript_exon.fa -o ~/lncRNA_project/07.identification/step3/CPC2/CPC2_result.txt > cpc2.log 2>&1 & nohup python CNCI.py \ -f ~/lncRNA_project/07.identification/step2/filter2_transcript_exon.fa \ -o ~/lncRNA_project/07.identification/step3/CNCI \ -m ve \ -p 4 > cnci.log 2>&1 & nohup python PLEK.py \ -fasta ~/lncRNA_project/07.identification/step2/filter2_transcript_exon.fa \ -out ~/lncRNA_project/07.identification/step3/plek/plek \ -thread 4 > plek.log 2>&1 & less CPC2_result.txt|grep 'noncoding' |awk '{print $1}' > CPC2_id.txt $ wc -l CPC2_id.txt 55359 CPC2_id.txt less -S cpat_result.txt|awk '($6<0.364){print $1}' > cpat_id.txt sed -i '1d' file $ wc -l cpat_id.txt 51956 cpat_id.txt less lncFinder_result.txt|grep -w 'NonCoding' > lncFinder_id.txt $ wc -l lncFinder_id.txt 52442 lncFinder_id.txt less -S plek | grep -w 'Non-coding' |awk '{print $3}' |sed 's/>//g' > plek_id.txt $ wc -l plek_id.txt 55996 plek_id.txt# 4个软件取交集 cat *txt |sort |uniq -c |awk '{if( $1==4){print}}' |wc
step4:比对到Pfam据库 比对到Pfam据库,过滤 (E-value < 1e-5)
# 使用Transeq 将转录本序列翻译为6个可能的蛋白序列 transeq ../step3/intersection/filter3_by_noncoding_exon.fa filter4_protein.fa -frame=6 conda install pfam_scan nohup pfam_scan.pl -fasta ./filter4_protein.fa -dir /home/data/lihe/database/Pfam/ -out Pfam_scan.out > Pfam_scan.log 2>&1 &# 过滤 (E-value < 1e-5) grep -v '^#' Pfam_scan.out | grep -v '^\s*$' | awk '($13< 1e-5){print $1}' | awk -F "_" '{print$1}' | sort | uniq > coding.ID wc coding.ID 3751 ## 直接过滤 fastq 文件即可 grep -v -f coding.ID ../step3/intersection/filter3_transcript_ID > filter4_transcript_ID $ wc -l filter4_transcript_ID 30259 filter4_transcript_ID grep -Ff filter4_transcript_ID -w ../step3/intersection/filter3_by_noncoding.gtf > filter4_by_pfam.gtf awk '($3=="exon"){print$0}' filter4_by_pfam.gtf > filter4_by_pfam_exon.gtf gffread -w filter4_by_pfam_exon.fa -g /home/data/server/reference/genome/hg38/hg38.fa filter4_by_pfam_exon.gtf $ grep -c '^>' filter4_by_pfam_exon.fa 30259
step5:到NR数据库 diamond blastx 到NR数据库,过滤 (E-value < 1e-5)
nohup diamond blastx -e 1e-5 -d ~/database/blastDB/nr/diamond/nr -q ../step4/filter4_by_pfam_exon.fa -f 6 -o ./dna_matches.txt & $ less -S dna_matches.txt |cut -f 1 | sort -V | uniq -c |wc -l 25012# 过滤 (E-value < 1e-5) awk '($11<1e-2){print$1}' dna_matches.txt | sort | uniq > lncRNA.ID grep -v -f ../step5/lncRNA.ID -w ../step4/filter4_transcript_ID > filter5_by_nr_ID grep -Ff filter5_by_nr_ID -w ../step4/filter4_by_pfam.gtf > filter5_by_nr.gtf
step6:过滤掉低表达量的lncRNA 通过count数量或FPKM过滤掉低表达量的lncRNA
# featureCounts 统计count nohup featureCounts -T 8 -a \ ~/lncRNA_project/07. identification/step7/filter5_by_nr.gtf \ -o ./raw_count.txt -p -B -C -f -t transcript -g transcript_id \ ~/lncRNA_project/04. mapping/*.bam > transcript_featureCount.log 2 >&1 &# R 语言计算FPKM,筛选:FPKM > 0 in at least one sample ,得到lncRNA_id.txt rm(list=ls())# make count table raw_df <- read.table(file = "~/lncRNA_project/test/08.featurecounts/raw_count.txt" ,header = T ,skip = 1 ,sep = "\t" ) count_df <- raw_df[ ,c(7 :ncol(raw_df))] metadata <- raw_df[ ,1 :6 ] # 提取基因信息count数据前的几列 rownames(count_df) <- as.character(raw_df[,1 ]) colnames(count_df) <- paste0("SRR10744" ,251 :439 )# calculate FPKM countToFpkm <- function (counts, effLen) { N <- colSums(counts) exp( log(counts) + log(1e9 ) - log(effLen) - log(N) ) } options(scipen = 200 ) # 表示在200个数字以内都不使用科学计数法 fpkm = countToFpkm(count_df, metadata$Length)# View(fpkm) # FPKM > 0 in at least one sample count_df.filter <- count_df[rowSums(fpkm)>0 ,] write.table(rownames(count_df.filter,file="~/lncRNA_project/test/09.all_lncRNA/filter6_by_fpkm_id" , sep="\t" ,quote=F )# linux 里提取最终lncRNA的gtf文件 grep -Ff filter6_by_fpkm_id -w ~/lncRNA_project/07. identification/step7/filter5_by_nr.gtf > lncRNA.gtf# 根据exon位置信息提取基因组序列,组装成转录本序列 gffread -w lncRNA.fa -g /home/data/server/reference/genome/hg38/hg38.fa lncRNA.gtf $ grep -c '^>' lncRNA.fa
featureCounts对组装出的lncRNA、mRNA、other_RNA定量 接下来就可以拿到组装好的gtf文件,对原来的测序文件的比对后的bam文件进行定量操作。
# ############# lncRNA featureaCounts ###################### nohup featureCounts -t transcript -g transcript_id \ -Q 10 --primary -s 0 -p -f -T 8 \ -a ../lncRNA.gtf \ -o ./raw_count.txt \ ~/lncRNA_project/04.mapping/*.bam \> featureCounts.log 2>&1 & #########----------- protein_coding-------------------- ############# less -S gencode.v37.annotation.gtf | grep -w 'gene_type "protein_coding"' > protein_coding_gene.gtf less -S gencode.v37.annotation.gtf | grep -w 'gene_type "lncRNA"' > known_lncRNA_gene.gtf # featureCounts 定量 nohup featureCounts -t gene -g gene_id \ -Q 10 --primary -s 0 -p -f -T 8 \ -a ../protein_coding_gene.gtf \ -o ./raw_count.txt \ ~/lncRNA_project/04.mapping/*.bam \> featureCounts.log 2>&1 & #########----------- other_RNA-------------------- ############# less -S ~/lncRNA_project/10.mRNA/gencode.v37.annotation.gtf | grep -w -v 'gene_type "protein_coding"' > other_coding_gene.gtf # featureCounts 定量 nohup featureCounts -t gene -g gene_id \ -Q 10 --primary -s 0 -p -f -T 8 \ -a ../other_coding_gene.gtf \ -o ./raw_count.txt \ ~/lncRNA_project/04.mapping/*.bam \> featureCounts.log 2>&1 &
NONCODE v6_human数据库 blastn 到 NONCODE v6_human,区分组装出的lncRNA为:know_lncRAN、novel_lncRNA
#######----blastn ——> NONCODEv6_human ------------------############# cat NONCODEv6_human.fa |seqkit rmdup -s -o clean.fa -d duplicated.fa -D duplicated.detail.txt nohup makeblastdb -in clean.fa -dbtype nucl -parse_seqids -out NONCODEv6_human & # 将blastn结果中e-value<=1e-10,min-identity=80%,min-coverage=50%的序列筛选出来当作比对上的序列 blastn -db NONCODEv6_human -evalue 1e-10 -num_threads 10 -max_target_seqs 5 -query ~/lncRNA_project/09.lncRNA/test.fa -outfmt ' 6 qseqid sseqid pident qcovs length mismatch gapopen qstart qend sstart send qseq sseq evalue bitscore' -out test.txt
lncRNA的功能推断 大量lncRNA的功能是未知的,但是它们主要是cis-regulators,所以可以根据它们临近的蛋白编码基因功能来近似推断,然后表达量的相关性也可以类推到。
根据位置关系推断 使用bedtools等工具!对DEL lncRNA 提取顺式靶mRNA,产生了lncRNA基因座/蛋白质编码基因座对的列表!
bedtools window -a lncRNA_metadata.gtf -b ~/lncRNA_project/10.protein_coding_RNA/gencode.v37.annotation.gtf -l 100000 -r 10000 > test.txt
表达量的相关性 首先自己的转录组测序数据里面就有蛋白编码基因,know_lncRAN、novel_lncRNA各自的表达量矩阵,就可以计算相关性。
也可以看数据库:
比如杂志Cancer Medicine , 2020的文章《 Genome-wide DNA methylation analysis by MethylRad and the transcriptome profiles reveal the potential cancer-related lncRNAs in colon cancer》,在进行结直肠癌相关lncRNA的功能富集分析 ,就是采用LncRN2Target v2.0和StarBase分析与15个lncRNA共表达的蛋白编码基因 ,其中lncRNA HULC和ZNF667-AS1分别鉴定到28个、9个共表达的蛋白编码基因!
LncSEA数据库 LncSEA(http://bio./LncSEA/index.php)着重于收录已发表的人类各种lncRNA信息,并可以对用户提交的lncRNA集合进行注释和富集分析,提供超过40000种参考lncRNA集合,包括18个类型(miRNA,drug,disease,methylation pattern,cancer specific phenotype,lncRNA binding protein,cancer hallmark,subcellular localization,survival,lncRNA-eQTL,cell marker,enhancer,super-enhancer,transcription factor,accessible chromatin and smORF,exosome和conservation)和66个亚类,包含超过了5万条lncRNA。
LncSEA主要包括Analysis,Search,Browse,ID conversion,Download 5个功能模块!