用户名
UID
Email
密码
记住
立即注册
找回密码
只需一步,快速开始
微信扫一扫,快速登录
开启辅助访问
收藏本站
快捷导航
门户
Portal
社区
资讯
会议
市场
产品
问答
数据
专题
帮助
签到
每日签到
企业联盟
人才基地
独立实验室
产业园区
投资机构
检验科
招标动态
供给发布
同行交流
悬赏任务
共享资源
VIP资源
百科词条
互动话题
导读
动态
广播
淘贴
法规政策
市场营销
创业投资
会议信息
企业新闻
新品介绍
体系交流
注册交流
临床交流
同行交流
技术杂谈
检验杂谈
今日桔说
共享资源
VIP专区
企业联盟
投资机构
产业园区
业务合作
投稿通道
升级会员
联系我们
搜索
搜索
本版
文章
帖子
用户
小桔灯网
»
社区
›
H、检验医学区
›
室间质评
›
开箱即用版本 满分室间质评之GATK Somatic SNV+Indel+CN ...
图文播报
2025庆【网站十二周
2024庆中秋、迎国庆
2024庆【网站十一周
2023庆【网站十周年
2022庆【网站九周年
2021庆中秋、迎国庆
返回列表
查看:
8327
|
回复:
0
[分享]
开箱即用版本 满分室间质评之GATK Somatic SNV+Indel+CNV+SV(2024-04-30更新)
[复制链接]
虎威将军
虎威将军
当前离线
金桔
金币
威望
贡献
回帖
0
精华
在线时间
小时
雷达卡
发表于 2024-11-11 06:27
|
显示全部楼层
|
阅读模式
登陆有奖并可浏览互动!
您需要
登录
才可以下载或查看,没有账号?
立即注册
×
开箱即用版本 满分室间质评之GATK Somatic SNV+Indel+CNV+SV
本次更新支持基因组版本切换hg19/hg38,以及项目bed文件;方便各种项目切换和适配以及bug修复
最近准备为
sliverworkspace
图形化生信平台开发报告设计器,需要一个较为复杂的pipeline作为测试数据,就想起来把之前的 [[满分室间质评之GATK Somatic SNV+Indel+CNV+SV(下)性能优化]]翻出来用一下。跑了一遍发现还是各种问题,于是想把pipeline改造成免部署、首次运行初始化环境的版本,以便需要时候能够直接运行起来,于是有了本文。
一句话描述就是:开箱即用的pipeline,能够根据样本version_reference自动选择参考基因组版本,根据project_bed文件选择项目bed,自动初始化环境、安装所需软件、下载ref文件和数据库的版本
为了让pipeline能够直接运行,无需部署,这里使用docker容器保证运行环境的一致性:见文章:
基于docker的生信基础环境镜像构建
,这里采用的方案是带ssh服务的docker+conda环境,整个pipeline在一个通用容器中运行。
本文代码较长,可能略复杂,想直接运行的可以
下载 workflow文件
,导入sliverworkspace图形化生信平台直接运行,获取并查看图形化
分析报告
相关代码和workflow文件可以访问笔者
github项目地址
导入操作
https://www.sliverworkspace.com/docs/pipelines/gatk-somatic/import_new.gif
分析流程整体概览
docker 镜像拉取及部署配置
# 拉取docker镜像
docker pull doujiangbaozi/sliverworkspace:1.10
docker-compose.yml配置文件
version: "3"
services:
GATK:
image: doujiangbaozi/sliverworkspace:1.10
container_name: GATK
volumes:
- /home/sliver/Data/data:/opt/data:rw #挂载输入数据目录
- /home/sliver/Manufacture/gatk/envs:/root/mambaforge-pypy3/envs #挂载envs目录
- /home/sliver/Manufacture/sliver/ref:/opt/ref:rw #挂载reference目录
- /home/sliver/Manufacture/gatk/result:/opt/result:rw #挂载中间文件和结果目录
environment:
- TZ=Asia/Shanghai #设置时区
- PS=20191124 #设置ssh密码
- PT=9024 #设置ssh连接端口
docker 镜像部署运行
# 在docker-compose.yml文件同级目录下运行
docker-compose up -d
# 或者docker-compose -f docker-compose.yml所在目录
docker-compose -f somedir/docker-compose.yml up -d
# 可以通过ssh连接到docker 运行pipeline命令了,连接端口和密码见docker-compose.yml配置文件相关字段
ssh root@127.0.0.1 -p9024
测试数据
测试数据来自2017年卫计委室间质评提供的bed文件(pipeline会自动下载)和测试数据,修改命名以匹配pipeline输入端,也可以替换为自己的数据文件,因为室间质评目前参考基因组还停留在hg19版本,所以本流程仍然使用hg19(GRCH37),如果要切换到hg38,可以将version_reference变量值设置为hg38,project_bed设置为Illumina_pt2_hg38.bed。pipeline会使用hg38(GRCH38)版本和对应的bed文件。
文件名(按照需要有调整)
文件大小
MD5
B1701_R1.fq.gz
4.85G
07d3cdccee41dbb3adf5d2e04ab28e5b
B1701_R2.fq.gz
4.77G
c2aa4a8ab784c77423e821b9f7fb00a7
B1701NC_R1.fq.gz
3.04G
4fc21ad05f9ca8dc93d2749b8369891b
B1701NC_R2.fq.gz
3.11G
bc64784f2591a27ceede1727136888b9
变量名称
# 变量初始化赋值
sn=1701 #样本编号
pn=GS03 #pipeline 代号
project_bed=Illumina_pt2.bed #参考基因组hg19下的bed文件
version_reference=hg19 #人参考基因组版本hg19或者hg38
version_fastqc=0.12.1 #fastqc 版本
version_multiqc=1.21 #multiqc 版本
version_cnvkit=0.9.10 #cnvkit 版本
version_manta=1.6.0 #manta 版本
version_gatk=4.3.0.0 #gatk 版本
version_sambamba=1.0.1 #sambamba 版本
version_samtools=1.17 #samtools 版本
version_bwa=0.7.17 #bwa 版本
version_fastp=0.23.2 #fastp 版本
version_vep=108 #vep 版本
envs=/root/mambaforge-pypy3/envs #mamba envs 目录
threads=32 #最大线程数
memory=32G #内存占用
scatter=8 #Mutect2 分拆并行运行interval list 份数
event=2 #gatk FilterMutectCalls --max-events-in-region 数值
snv_tlod=16.00 #snv 过滤参数 tload 值
snv_vaf=0.01 #snv 过滤参数 丰度/突变频率
snv_depth=500 #snv 过滤参数 支持reads数/depth 测序深度
cnv_dep=1000 #cnv 过滤参数 支持reads数/depth 测序深度
cnv_min=-0.5 #cnv 过滤参数 log2最小值
cnv_max=0.5 #cnv 过滤参数 log2 最大值
sv_score=200 #sv 过滤参数 score
## 以上变量个可以具体根据需求调整表格:
变量名
变量值
备注
sn
1701
样本编号
pn
GS03
pipeline 代号 GATK Somatic 03版本
project_bed
Illumina_pt2.bed
参考基因组hg19下的bed文件
version_reference
hg19
人参考基因组版本hg19 / hg38
version_fastqc
0.12.1
fastqc 版本
version_multiqc
1.21
multiqc 版本
version_cnvkit
0.9.10
cnvkit 版本
version_manta
1.6.0
manta版本
version_gatk
4.3.0.0
gatk 版本
version_sambamba
1.0.1
sambamba 版本
version_samtools
1.17
samtools 版本
version_bwa
0.7.17
bwa 版本
version_fastp
0.23.2
fastp 版本
version_vep
108
vep 版本
envs
/root/mambaforge-pypy3/envs
mamba envs 目录
threads
32
最大线程数
memory
32G
内存占用
scatter
8
Mutect2 分拆并行运行interval list 份数
event
2
gatk FilterMutectCalls --max-events-in-region 数值
snv_tlod
16.00
snv 过滤参数 tload 值
snv_vaf
0.01
snv 过滤参数 丰度/突变频率
snv_depth
500
snv 过滤参数 支持reads数/depth 测序深度
cnv_dep
1000
cnv 过滤参数 支持reads数/depth 测序深度
cnv_min
-0.5
cnv 过滤参数 log2最小值
cnv_max
0.5
cnv 过滤参数 log2 最大值
sv_score
200
sv 过滤参数 score
Pipeline/workflow 具体步骤:
1. fastp 默认参数过滤,看下原始数据质量,clean data
#conda检测环境是否存在,首次运行不存在创建该环境并安装软件
if [ ! -d "${envs}/${pn}.fastp" ]; then
echo "Creating the environment ${pn}.fastp"
mamba create -n ${pn}.fastp -y fastp=${version_fastp} fastqc=${version_fastqc} multiqc=${version_multiqc}
fi
mamba activate ${pn}.fastp
mkdir -p ${result}/${sn}/trimmed
mkdir -p ${result}/${sn}/qc
fastqc -t ${threads}\
${data}/GS03/${sn}_tumor_R1.fq.gz \
${data}/GS03/${sn}_tumor_R2.fq.gz \
-o ${result}/${sn}/qc &
fastqc -t ${threads}\
${data}/GS03/${sn}_normal_R1.fq.gz \
${data}/GS03/${sn}_normal_R2.fq.gz \
-o ${result}/${sn}/qc &
fastp -w 16 \
-i ${data}/GS03/${sn}_tumor_R1.fq.gz \
-I ${data}/GS03/${sn}_tumor_R2.fq.gz \
-o ${result}/${sn}/trimmed/${sn}_tumor_R1_trimmed.fq.gz \
-O ${result}/${sn}/trimmed/${sn}_tumor_R2_trimmed.fq.gz \
-h ${result}/${sn}/trimmed/${sn}_tumor_fastp.html \
-j ${result}/${sn}/trimmed/${sn}_tumor_fastp.json &
fastp -w 16 \
-i ${data}/GS03/${sn}_normal_R1.fq.gz \
-I ${data}/GS03/${sn}_normal_R2.fq.gz \
-o ${result}/${sn}/trimmed/${sn}_normal_R1_trimmed.fq.gz \
-O ${result}/${sn}/trimmed/${sn}_normal_R2_trimmed.fq.gz \
-h ${result}/${sn}/trimmed/${sn}_normal_fastp.html \
-j ${result}/${sn}/trimmed/${sn}_normal_fastp.json &
wait
fastqc -t ${threads}\
${result}/${sn}/trimmed/${sn}_tumor_R1_trimmed.fq.gz \
${result}/${sn}/trimmed/${sn}_tumor_R2_trimmed.fq.gz \
-o ${result}/${sn}/qc &
fastqc -t ${threads}\
${result}/${sn}/trimmed/${sn}_normal_R1_trimmed.fq.gz \
${result}/${sn}/trimmed/${sn}_normal_R2_trimmed.fq.gz \
-o ${result}/${sn}/qc &
wait
mamba deactivate
2. normal文件fastq比对到参考基因组,sort 排序,mark duplicate 得到 marked.bam
#conda检测环境是否存在,首次运行不存在创建该环境并安装软件
if [ ! -d "${envs}/${pn}.align" ]; then
mamba create -n ${pn}.align -y bwa=${version_bwa} samtools=${version_samtools}
fi
#从github下载sambamba static 比 mamba 安装的版本速度快1倍以上.这是个很诡异的地方
if [ ! -f "${envs}/${pn}.align/bin/sambamba" ]; then
aria2c https://github.com/biod/sambamba/releases/download/v${version_sambamba}/sambamba-${version_sambamba}-linux-amd64-static.gz -d ${envs}/${pn}.align/bin
gzip -cdf ${envs}/${pn}.align/bin/sambamba-${version_sambamba}-linux-amd64-static.gz > ${envs}/${pn}.align/bin/sambamba
chmod a+x ${envs}/${pn}.align/bin/sambamba
fi
mamba activate ${pn}.align
if [ "${version_reference}" == hg19 ]; then
echo "USE reference Version : ${version_reference}"
mkdir -p /opt/ref/hg19
#如果没有检测到参考基因组序列,则下载序列并使用bwa创建索引
if [ ! -f "/opt/ref/hg19/hg19.fasta" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -d /opt/ref/hg19 -o hg19.fasta.gz
cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
else
if [ -f "/opt/ref/hg19/ucsc.hg19.fasta.gz.aria2" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -c -d /opt/ref/hg19 -o hg19.fasta.gz
cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
fi
fi
if [ ! -f /opt/ref/hg19/hg19.fasta.amb ] ||
[ ! -f /opt/ref/hg19/hg19.fasta.ann ] ||
[ ! -f /opt/ref/hg19/hg19.fasta.bwt ] ||
[ ! -f /opt/ref/hg19/hg19.fasta.pac ] ||
[ ! -f /opt/ref/hg19/hg19.fasta.sa ]; then
bwa index /opt/ref/hg19/hg19.fasta
fi
#检测samtools索引是否存在,如不存在则使用samtools创建参考基因组索引
if [ ! -f "/opt/ref/hg19/hg19.fasta.fai" ]; then
samtools faidx /opt/ref/hg19/hg19.fasta
fi
elif [ "${version_reference}" == "hg38" ]; then
echo "USE reference Version : ${version_reference}"
mkdir -p /opt/ref/hg38
#如果没有检测到参考基因组序列,则下载序列并使用bwa创建索引
if [ ! -f "/opt/ref/hg38/hg38.fasta" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz -d /opt/ref/hg38 -o hg38.fasta.gz
cd /opt/ref/hg38 && gzip -d /opt/ref/hg38/hg38.fasta.gz
else
if [ -f "/opt/ref/hg38/hg38.fasta.gz.aria2" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz -d /opt/ref/hg38 -d /opt/ref/hg38 -o hg38.fasta.gz
cd /opt/ref/hg38 && gzip -d /opt/ref/hg38/hg38.fasta.gz
fi
fi
if [ ! -f /opt/ref/hg38/hg38.fasta.amb ] ||
[ ! -f /opt/ref/hg38/hg38.fasta.ann ] ||
[ ! -f /opt/ref/hg38/hg38.fasta.bwt ] ||
[ ! -f /opt/ref/hg38/hg38.fasta.pac ] ||
[ ! -f /opt/ref/hg38/hg38.fasta.sa ]; then
bwa index /opt/ref/hg38/hg38.fasta
fi
#检测samtools索引是否存在,如不存在则使用samtools创建参考基因组索引
if [ ! -f "/opt/ref/hg38/hg38.fasta.fai" ]; then
samtools faidx /opt/ref/hg38/hg38.fasta
fi
fi
mkdir -p ${result}/${sn}/aligned
#比对基因组管道输出成bam文件,管道输出排序
bwa mem \
-t ${threads} -M \
-R "@RG\\tID:${sn}_normal\\tLB:${sn}_normal\\tPL:Illumina\\tPU:Miseq\\tSM:${sn}_normal" \
/opt/ref/${version_reference}/${version_reference}.fasta ${result}/${sn}/trimmed/${sn}_normal_R1_trimmed.fq.gz ${result}/${sn}/trimmed/${sn}_normal_R2_trimmed.fq.gz \
| sambamba view -S -f bam -l 0 /dev/stdin \
| sambamba sort -t ${threads} -m 2G --tmpdir=${result}/${sn}/aligned -o ${result}/${sn}/aligned/${sn}_normal_sorted.bam /dev/stdin
#防止linux打开文件句柄数超过限制,报错退出
ulimit -n 10240
#使用sambamba对sorted bam文件标记重复
sambamba markdup \
--tmpdir ${result}/${sn}/aligned \
-t ${threads} ${result}/${sn}/aligned/${sn}_normal_sorted.bam ${result}/${sn}/aligned/${sn}_normal_marked.bam
#修改marked bam文件索引名,gatk和sambamba索引文件名需要保持一致
mv ${result}/${sn}/aligned/${sn}_normal_marked.bam.bai ${result}/${sn}/aligned/${sn}_normal_marked.bai
#删除sorted bam文件
rm -f ${result}/${sn}/aligned/${sn}_normal_sorted.bam*
mamba deactivate
3. tumor文件fastq比对到参考基因组,排序,mark duplicate 得到 marked.bam,与第2步类似
if [ ! -d "${envs}/${pn}.align" ]; then
mamba create -n ${pn}.align -y bwa=${version_bwa} samtools=${version_samtools}
fi
#从github下载sambamba static 比 mamba 安装的版本速度快1倍以上.
if [ ! -f "${envs}/${pn}.align/bin/sambamba" ]; then
aria2c https://github.com/biod/sambamba/releases/download/v${version_sambamba}/sambamba-${version_sambamba}-linux-amd64-static.gz -d ${envs}/${pn}.align/bin
gzip -cdf ${envs}/${pn}.align/bin/sambamba-${version_sambamba}-linux-amd64-static.gz > ${envs}/${pn}.align/bin/sambamba
chmod a+x ${envs}/${pn}.align/bin/sambamba
fi
mamba activate ${pn}.align
if [ "${version_reference}" == hg19 ]; then
echo "USE reference Version : ${version_reference}"
mkdir -p /opt/ref/hg19
#如果没有检测到参考基因组序列,则下载序列并使用bwa创建索引
if [ ! -f "/opt/ref/hg19/hg19.fasta" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -d /opt/ref/hg19 -o hg19.fasta.gz
cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
else
if [ -f "/opt/ref/hg19/ucsc.hg19.fasta.gz.aria2" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -c -d /opt/ref/hg19 -o hg19.fasta.gz
cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
fi
fi
if [ ! -f /opt/ref/hg19/hg19.fasta.amb ] ||
[ ! -f /opt/ref/hg19/hg19.fasta.ann ] ||
[ ! -f /opt/ref/hg19/hg19.fasta.bwt ] ||
[ ! -f /opt/ref/hg19/hg19.fasta.pac ] ||
[ ! -f /opt/ref/hg19/hg19.fasta.sa ]; then
bwa index /opt/ref/hg19/hg19.fasta
fi
#检测samtools索引是否存在,如不存在则使用samtools创建参考基因组索引
if [ ! -f "/opt/ref/hg19/hg19.fasta.fai" ]; then
samtools faidx /opt/ref/hg19/hg19.fasta
fi
elif [ "${version_reference}" == "hg38" ]; then
echo "USE reference Version : ${version_reference}"
mkdir -p /opt/ref/hg38
#如果没有检测到参考基因组序列,则下载序列并使用bwa创建索引
if [ ! -f "/opt/ref/hg38/hg38.fasta" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz -d /opt/ref/hg38 -o hg38.fasta.gz
cd /opt/ref/hg38 && gzip -d /opt/ref/hg38/hg38.fasta.gz
else
if [ -f "/opt/ref/hg38/hg38.fasta.gz.aria2" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz -d /opt/ref/hg38 -d /opt/ref/hg38 -o hg38.fasta.gz
cd /opt/ref/hg38 && gzip -d /opt/ref/hg38/hg38.fasta.gz
fi
fi
if [ ! -f /opt/ref/hg38/hg38.fasta.amb ] ||
[ ! -f /opt/ref/hg38/hg38.fasta.ann ] ||
[ ! -f /opt/ref/hg38/hg38.fasta.bwt ] ||
[ ! -f /opt/ref/hg38/hg38.fasta.pac ] ||
[ ! -f /opt/ref/hg38/hg38.fasta.sa ]; then
bwa index /opt/ref/hg38/hg38.fasta
fi
#检测samtools索引是否存在,如不存在则使用samtools创建参考基因组索引
if [ ! -f "/opt/ref/hg38/hg38.fasta.fai" ]; then
samtools faidx /opt/ref/hg38/hg38.fasta
fi
fi
mkdir -p ${result}/${sn}/aligned
bwa mem \
-t ${threads} -M \
-R "@RG\\tID:${sn}_tumor\\tLB:${sn}_tumor\\tPL:Illumina\\tPU:Miseq\\tSM:${sn}_tumor" \
/opt/ref/${version_reference}/${version_reference}.fasta ${result}/${sn}/trimmed/${sn}_tumor_R1_trimmed.fq.gz ${result}/${sn}/trimmed/${sn}_tumor_R2_trimmed.fq.gz \
| sambamba view -S -f bam -l 0 /dev/stdin \
| sambamba sort -t ${threads} -m 2G --tmpdir=${result}/${sn}/aligned -o ${result}/${sn}/aligned/${sn}_tumor_sorted.bam /dev/stdin
ulimit -n 10240
sambamba markdup \
--tmpdir ${result}/${sn}/aligned \
-t ${threads} ${result}/${sn}/aligned/${sn}_tumor_sorted.bam ${result}/${sn}/aligned/${sn}_tumor_marked.bam
mv ${result}/${sn}/aligned/${sn}_tumor_marked.bam.bai ${result}/${sn}/aligned/${sn}_tumor_marked.bai
rm -f ${result}/${sn}/aligned/${sn}_tumor_sorted.bam*
mamba deactivate
4. 对上述bam文件生成重新校准表,为后续BQSR使用;Generates recalibration table for Base Quality Score Recalibration (BQSR)
#conda检测环境是否存在,首次运行不存在创建该环境并安装软件
if [ ! -f "${envs}/${pn}.gatk/bin/gatk" ]; then
mamba create -n ${pn}.gatk -y gatk4=${version_gatk} \
r-base=3.6.2 \
r-data.table=1.12.8 \
r-dplyr=0.8.5 \
r-getopt=1.20.3 \
r-ggplot2=3.3.0 \
r-gplots=3.0.3 \
r-gsalib=2.1 \
r-optparse=1.6.4 \
r-backports=1.1.10 \
tabix
fi
mamba activate ${pn}.gatk
if [ "${version_reference}" == hg19 ]; then
echo "USE reference Version : ${version_reference}"
#这里有个巨坑,broadinstitute ftp站点bundle目录提供的hg19版本参考文件,默认格式运行会报错,提示没有索引,使用gatk创建索引仍然报错,其实是gz格式需要使用bgzip重新压缩并且使用tabix创建索引才行
if [ ! -f "/opt/ref/hg19/dbsnp_138.hg19.vcf.gz" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.gz -d /opt/ref/hg19
else
if [ -f "/opt/ref/hg19/dbsnp_138.hg19.vcf.gz.aria2" ]; then
echo 'download continue...'
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.gz -c -d /opt/ref/hg19
fi
fi
if [ ! -f "/opt/ref/hg19/dbsnp_138.hg19.vcf.gz.tbi" ]; then
gzip -k -f -d /opt/ref/hg19/dbsnp_138.hg19.vcf.gz > /opt/ref/hg19/dbsnp_138.hg19.vcf
bgzip -f --threads ${threads} /opt/ref/hg19/dbsnp_138.hg19.vcf > /opt/ref/hg19/dbsnp_138.hg19.vcf.gz
tabix -f /opt/ref/hg19/dbsnp_138.hg19.vcf.gz
fi
if [ ! -f "/opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz -d /opt/ref/hg19
else
if [ -f "/opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz.aria2" ]; then
echo 'download continue...'
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz -c -d /opt/ref/hg19
fi
fi
if [ ! -f "/opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz.tbi" ]; then
gzip -k -f -d /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz > /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
bgzip -f --threads ${threads} /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf > /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
tabix -f /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
fi
if [ ! -f "/opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz -d /opt/ref/hg19
else
if [ -f "/opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz.aria2" ]; then
echo 'download continue...'
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz -c -d /opt/ref/hg19
fi
fi
if [ ! -f "/opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz.tbi" ]; then
gzip -k -f -d /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz > /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf
bgzip -f --threads ${threads} /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf > /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
tabix -f /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
fi
if [ ! -f "/opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz -d /opt/ref/
else
if [ -f "/opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz.aria2" ]; then
echo 'download continue...'
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz -c -d /opt/ref/hg19
fi
fi
if [ ! -f "/opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz.tbi" ]; then
gzip -k -f -d /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz > /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf
bgzip -f --threads ${threads} /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf > /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
tabix -f /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
fi
#创建参考序列hg19的dict字典文件
if [ ! -f "/opt/ref/hg19/hg19.dict" ]; then
gatk CreateSequenceDictionary -R /opt/ref/hg19/hg19.fasta -O /opt/ref/hg19/hg19.dict
fi
if [ ! -f "/opt/ref/${version_reference}/${project_bed}" ]; then
#根据下载的Illumina_pt2.bed 文件创建interval list文件,坐标转换,起始坐标0修改为1
#sed 's/chr//; s/\t/ /g' /opt/ref/hg19/Illumina_pt2.bed > /opt/ref/hg19/Illumina_pt2.processed.bed
mkdir -p /opt/ref/${version_reference}
aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/projects/${project_bed} -d /opt/ref/hg19
if [ ! -f "/opt/ref/hg19/${project_bed}.interval_list" ]; then
gatk BedToIntervalList \
-I /opt/ref/hg19/${project_bed} \
-SD /opt/ref/hg19/hg19.dict \
-O /opt/ref/hg19/${project_bed}.interval_list
fi
fi
mkdir -p ${result}/${sn}/recal
gatk BaseRecalibrator \
--known-sites /opt/ref/hg19/dbsnp_138.hg19.vcf.gz \
--known-sites /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz \
--known-sites /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz \
-L /opt/ref/hg19/${project_bed}.interval_list \
-R /opt/ref/hg19/hg19.fasta \
-I ${result}/${sn}/aligned/${sn}_tumor_marked.bam \
-O ${result}/${sn}/recal/${sn}_tumor_recal.table &
gatk BaseRecalibrator \
--known-sites /opt/ref/hg19/dbsnp_138.hg19.vcf.gz \
--known-sites /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz \
--known-sites /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz \
-L /opt/ref/hg19/${project_bed}.interval_list \
-R /opt/ref/hg19/hg19.fasta \
-I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
-O ${result}/${sn}/recal/${sn}_normal_recal.table &
wait
elif [ "${version_reference}" == "hg38" ]; then
echo "USE reference Version : ${version_reference}"
if [ ! -f "/opt/ref/hg38/dbsnp_146.hg38.vcf.gz" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz -d /opt/ref/hg38
else
if [ -f "/opt/ref/hg38/dbsnp_146.hg38.vcf.gz.aria2" ]; then
echo 'download continue...'
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz -c -d /opt/ref/hg38
fi
fi
if [ ! -f "/opt/ref/hg38/dbsnp_146.hg38.vcf.gz.tbi" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz.tbi -d /opt/ref/hg38
else
if [ -f "/opt/ref/hg38/dbsnp_146.hg38.vcf.gz.tbi.aria2" ]; then
echo 'download continue...'
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz.tbi -c -d /opt/ref/hg38
fi
fi
if [ ! -f "/opt/ref/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -d /opt/ref/hg38
else
if [ -f "/opt/ref/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.aria2" ]; then
echo 'download continue...'
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -c -d /opt/ref/hg38
fi
fi
if [ ! -f "/opt/ref/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi -d /opt/ref/hg38
else
if [ -f "/opt/ref/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi.aria2" ]; then
echo 'download continue...'
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi -c -d /opt/ref/hg38
fi
fi
if [ ! -f "/opt/ref/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz -d /opt/ref/hg38
else
if [ -f "/opt/ref/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.aria2" ]; then
echo 'download continue...'
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz -c -d /opt/ref/hg38
fi
fi
if [ ! -f "/opt/ref/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi -d /opt/ref/hg38
else
if [ -f "/opt/ref/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi.aria2" ]; then
echo 'download continue...'
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi -c -d /opt/ref/hg38
fi
fi
#创建参考序列hg38的dict字典文件
if [ ! -f "/opt/ref/hg38/hg38.dict" ]; then
gatk CreateSequenceDictionary -R /opt/ref/hg38/hg38.fasta -O /opt/ref/hg38/hg38.dict
fi
if [ ! -f "/opt/ref/${version_reference}/${project_bed}" ]; then
#根据下载的Illumina_pt2.bed 文件创建interval list文件,坐标转换,起始坐标0修改为1
#sed 's/chr//; s/\t/ /g' /opt/ref/hg38/Illumina_pt2.bed > /opt/ref/hg38/Illumina_pt2.processed.bed
mkdir -p /opt/ref/hg38
aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/projects/${project_bed} -d /opt/ref/hg38 -o ${project_bed}
if [ ! -f "/opt/ref/hg38/${project_bed}.interval_list" ]; then
gatk BedToIntervalList \
-I /opt/ref/hg38/${project_bed} \
-SD /opt/ref/hg38/hg38.dict \
-O /opt/ref/hg38/${project_bed}.interval_list
fi
fi
mkdir -p ${result}/${sn}/recal
gatk BaseRecalibrator \
--known-sites /opt/ref/hg38/dbsnp_146.hg38.vcf.gz \
--known-sites /opt/ref/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-sites /opt/ref/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-L /opt/ref/hg38/${project_bed}.interval_list \
-R /opt/ref/hg38/hg38.fasta \
-I ${result}/${sn}/aligned/${sn}_tumor_marked.bam \
-O ${result}/${sn}/recal/${sn}_tumor_recal.table &
gatk BaseRecalibrator \
--known-sites /opt/ref/hg38/dbsnp_146.hg38.vcf.gz \
--known-sites /opt/ref/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-sites /opt/ref/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-L /opt/ref/hg38/${project_bed}.interval_list \
-R /opt/ref/hg38/hg38.fasta \
-I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
-O ${result}/${sn}/recal/${sn}_normal_recal.table &
wait
fi
•
mamba deactivate
5. 使用校准表对bam碱基质量校准,因为这一步gatk效率感人,所以同时计算insertsize,拆分interval list(后续mutect2并行运行需要),运行cnvkit batch,运行samtools depth计算测序深度,samtools flagstat 统计mapping比例及质量
mkdir -p ${result}/${sn}/bqsr
mkdir -p ${result}/${sn}/stat
mkdir -p ${result}/${sn}/cnv
mkdir -p ${result}/${sn}/interval
mamba activate ${pn}.gatk
gatk ApplyBQSR \
--bqsr-recal-file ${result}/${sn}/recal/${sn}_tumor_recal.table \
-L /opt/ref/${version_reference}/${project_bed}.interval_list \
-R /opt/ref/${version_reference}/${version_reference}.fasta \
-I ${result}/${sn}/aligned/${sn}_tumor_marked.bam \
-O ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam &
•
gatk ApplyBQSR \
--bqsr-recal-file ${result}/${sn}/recal/${sn}_normal_recal.table \
-L /opt/ref/${version_reference}/${project_bed}.interval_list \
-R /opt/ref/${version_reference}/${version_reference}.fasta \
-I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
-O ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam &
•
gatk CollectInsertSizeMetrics \
-I ${result}/${sn}/aligned/${sn}_tumor_marked.bam \
-O ${result}/${sn}/stat/${sn}_tumor_insertsize_metrics.txt \
-H ${result}/${sn}/stat/${sn}_tumor_insertsize_histogram.pdf &
•
gatk CollectInsertSizeMetrics \
-I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
-O ${result}/${sn}/stat/${sn}_normal_insertsize_metrics.txt \
-H ${result}/${sn}/stat/${sn}_normal_insertsize_histogram.pdf &
rm -f ${result}/${sn}/interval/*.interval_list
gatk SplitIntervals \
-L /opt/ref/${version_reference}/${project_bed}.interval_list \
-R /opt/ref/${version_reference}/${version_reference}.fasta \
-O ${result}/${sn}/interval \
--scatter-count ${scatter} &
mamba deactivate
if [ ! -d "${envs}/${pn}.cnvkit" ]; then
mamba create -n ${pn}.cnvkit -y cnvkit=${version_cnvkit}
fi
if [ ! -f "/opt/ref/${version_reference}/refFlat.txt" ]; then
aria2c -x 16 -s 16 http://hgdownload.soe.ucsc.edu/goldenPath/${version_reference}/database/refFlat.txt.gz -d /opt/ref/${version_reference}
cd /opt/ref/${version_reference} && gzip -d refFlat.txt.gz
fi
mamba activate ${pn}.cnvkit
rm -f ${result}/${sn}/cnv/${sn}_reference.cnn
cnvkit.py batch \
${result}/${sn}/aligned/${sn}_tumor_marked.bam \
--normal ${result}/${sn}/aligned/${sn}_normal_marked.bam \
--method hybrid \
--fasta /opt/ref/${version_reference}/${version_reference}.fasta \
--targets /opt/ref/${version_reference}/${project_bed} \
--annotate /opt/ref/${version_reference}/refFlat.txt \
--output-reference ${result}/${sn}/cnv/${sn}_reference.cnn \
--output-dir ${result}/${sn}/cnv/ \
--diagram \
-p ${threads} &
mamba deactivate
mamba activate ${pn}.align
samtools depth -a -b /opt/ref/${version_reference}/Illumina_pt2.bed --threads ${threads} \
${result}/${sn}/aligned/${sn}_tumor_marked.bam > \
${result}/${sn}/stat/${sn}_tumor_marked.depth &
samtools depth -a -b /opt/ref/${version_reference}/Illumina_pt2.bed --threads ${threads} \
${result}/${sn}/aligned/${sn}_normal_marked.bam > \
${result}/${sn}/stat/${sn}_normal_marked.depth &
samtools flagstat --threads ${threads} \
${result}/${sn}/aligned/${sn}_tumor_marked.bam > \
${result}/${sn}/stat/${sn}_tumor_marked.flagstat &
samtools flagstat --threads ${threads} \
${result}/${sn}/aligned/${sn}_normal_marked.bam > \
${result}/${sn}/stat/${sn}_normal_marked.flagstat &
mamba deactivate
wait
6. 计算堆叠数据( pileup metrics )以便后续评估污染,也可以根据拆分的interval list并行处理,处理之后合并。
#官方巨坑,默认提供的small_exac_common_3_b37.vcf.gz默认染色体坐标不是以chr开头而是数字
mamba activate ${pn}.gatk
if [ "${version_reference}" == hg19 ]; then
echo "USE reference Version : ${version_reference}"
#这里有个巨坑,从broadinstitute ftp 站点bundle Mutect2目录下载的参考文件,与同样下载的参考序列基因组坐标系不一致,参考基因组参考序列是chr1这种格式,这个af-only-gnomad是1,2,3这种格式,需要编写脚本处理
if [ ! -f "/opt/ref/hg19/small_exac_common_3_b37.processed.vcf.gz" ]; then
aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/GetPileupSummaries/small_exac_common_3_b37.vcf.gz -d /opt/ref/hg19
else
if [ -f "/opt/ref/hg19/small_exac_common_3_b37.vcf.gz.aria2" ]; then
aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/GetPileupSummaries/small_exac_common_3_b37.vcf.gz -c -d /opt/ref/hg19
fi
fi
if [ ! -f "/opt/ref/${version_reference}/small_exac_common_3_b37.processed.vcf.gz" ]; then
if [ ! -f "${envs}/VcfProcessUtil.py" ]; then
aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
chmod a+x ${envs}/VcfProcessUtil.py
fi
mamba activate ${pn}.cnvkit
${envs}/VcfProcessUtil.py \
-f /opt/ref/${version_reference}/small_exac_common_3_b37.vcf.gz \
-o /opt/ref/${version_reference}/small_exac_common_3_b37.processed.vcf
mamba deactivate
mamba activate ${pn}.gatk
cd /opt/ref/${version_reference}
bgzip -f --threads ${threads} small_exac_common_3_b37.processed.vcf
tabix -f small_exac_common_3_b37.processed.vcf.gz
mamba deactivate
fi
mamba activate ${pn}.gatk
for i in `ls ${result}/${sn}/interval/*.interval_list`;
do
echo $i
gatk GetPileupSummaries \
-R /opt/ref/${version_reference}/${version_reference}.fasta \
-I ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam \
-O ${i%.*}-tumor-pileups.table \
-V /opt/ref/${version_reference}/small_exac_common_3_b37.processed.vcf.gz \
-L $i \
--interval-set-rule INTERSECTION &
gatk GetPileupSummaries \
-R /opt/ref/${version_reference}/${version_reference}.fasta \
-I ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam \
-O ${i%.*}-normal-pileups.table \
-V /opt/ref/${version_reference}/small_exac_common_3_b37.processed.vcf.gz \
-L $i \
--interval-set-rule INTERSECTION &
done
wait
mamba deactivate
elif [ "${version_reference}" == "hg38" ]; then
echo "USE reference Version : ${version_reference}"
if [ ! -f "/opt/ref/${version_reference}/small_exac_common_3.hg38.vcf.gz" ]; then
aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/GetPileupSummaries/small_exac_common_3.hg38.vcf.gz -d /opt/ref/${version_reference}
else
if [ -f "/opt/ref/hg19/small_exac_common_3_b37.vcf.gz.aria2" ]; then
aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/GetPileupSummaries/small_exac_common_3.hg38.vcf.gz -c -d /opt/ref/${version_reference}
fi
fi
if [ ! -f "/opt/ref/${version_reference}/small_exac_common_3.hg38.vcf.gz.tbi" ]; then
aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/GetPileupSummaries/small_exac_common_3.hg38.vcf.gz.tbi -d /opt/ref/${version_reference}
else
if [ -f "/opt/ref/hg19/small_exac_common_3_b37.vcf.gz.tbi.aria2" ]; then
aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/GetPileupSummaries/small_exac_common_3.hg38.vcf.gz.tbi -c -d /opt/ref/${version_reference}
fi
fi
mamba activate ${pn}.gatk
for i in `ls ${result}/${sn}/interval/*.interval_list`;
do
echo $i
gatk GetPileupSummaries \
-R /opt/ref/${version_reference}/${version_reference}.fasta \
-I ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam \
-O ${i%.*}-tumor-pileups.table \
-V /opt/ref/${version_reference}/small_exac_common_3.hg38.vcf.gz \
-L $i \
--interval-set-rule INTERSECTION &
gatk GetPileupSummaries \
-R /opt/ref/${version_reference}/${version_reference}.fasta \
-I ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam \
-O ${i%.*}-normal-pileups.table \
-V /opt/ref/${version_reference}/small_exac_common_3.hg38.vcf.gz \
-L $i \
--interval-set-rule INTERSECTION &
done
wait
mamba deactivate
fi
mamba activate ${pn}.gatk
tables=
for i in `ls ${result}/${sn}/interval/*-tumor-pileups.table`;
do
tables="$tables -I $i"
done
gatk GatherPileupSummaries \
--sequence-dictionary /opt/ref/${version_reference}/${version_reference}.dict \
$tables \
-O ${result}/${sn}/stat/${sn}_tumor_pileups.table
nctables=
for i in `ls ${result}/${sn}/interval/*-normal-pileups.table`;
do
nctables="$nctables -I $i"
done
gatk GatherPileupSummaries \
--sequence-dictionary /opt/ref/${version_reference}/${version_reference}.dict \
$nctables \
-O ${result}/${sn}/stat/${sn}_normal_pileups.table
mamba deactivate
7. 使用GetPileupSummaries计算结果评估跨样本污染,结果用于后面 FilterMutectCall 过滤Mutect2输出结果
mamba activate ${pn}.gatk
gatk CalculateContamination \
-matched ${result}/${sn}/stat/${sn}_normal_pileups.table \
-I ${result}/${sn}/stat/${sn}_tumor_pileups.table \
-O ${result}/${sn}/stat/${sn}_contamination.table
mamba deactivate
8. Mutect2 call 突变,使用拆分的interval list,结束后将结果合并;同时并行运行manta call sv突变
mkdir -p ${result}/${sn}/sv
mkdir -p ${result}/${sn}/snv
mamba activate ${pn}.gatk
if [ "${version_reference}" == hg19 ]; then
echo "USE reference Version : ${version_reference}"
#这里有个巨坑,从broadinstitute ftp 站点bundle Mutect2目录下载的参考文件,与同样下载的参考序列基因组坐标系不一致,参考基因组参考序列是chr1这种格式,这个af-only-gnomad是1,2,3这种格式,需要编写脚本处理;hg38貌似没有这个问题,hg19的数据都不维护了么?
if [ ! -f "/opt/ref/hg19/af-only-gnomad.raw.sites.b37.vcf.gz" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/af-only-gnomad.raw.sites.b37.vcf.gz -d /opt/ref/hg19
else
if [ -f "/opt/ref/hg19/af-only-gnomad.raw.sites.b37.vcf.gz.aria2" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/af-only-gnomad.raw.sites.b37.vcf.gz -c -d /opt/ref/hg19
fi
if [ ! -f "${envs}/VcfProcessUtil.py" ]; then
aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
chmod a+x ${envs}/VcfProcessUtil.py
fi
mamba activate ${pn}.cnvkit
${envs}/VcfProcessUtil.py \
-f /opt/ref/hg19/af-only-gnomad.raw.sites.b37.vcf.gz \
-o /opt/ref/hg19/af-only-gnomad.raw.sites.b37.processed.vcf
mamba deactivate
mamba activate ${pn}.gatk
cd /opt/ref/hg19
bgzip -f --threads ${threads} af-only-gnomad.raw.sites.b37.processed.vcf
tabix -f af-only-gnomad.raw.sites.b37.processed.vcf.gz
mamba deactivate
fi
elif [ "${version_reference}" == "hg38" ]; then
echo "USE reference Version : ${version_reference}"
if [ ! -f "/opt/ref/${version_reference}/af-only-gnomad.hg38.vcf.gz" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/af-only-gnomad.hg38.vcf.gz -d /opt/ref/${version_reference}
else
if [ -f "/opt/ref/${version_reference}/af-only-gnomad.hg38.vcf.gz.aria2" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/af-only-gnomad.hg38.vcf.gz -c -d /opt/ref/${version_reference}
fi
fi
if [ ! -f "/opt/ref/${version_reference}/af-only-gnomad.hg38.vcf.gz.tbi" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/af-only-gnomad.hg38.vcf.gz.tbi -d /opt/ref/${version_reference}
else
if [ -f "/opt/ref/${version_reference}/af-only-gnomad.hg38.vcf.gz.tbi.aria2" ]; then
aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/af-only-gnomad.hg38.vcf.gz.tbi -c -d /opt/ref/${version_reference}
fi
fi
fi
•
mamba activate ${pn}.gatk
if [ ! -f "/opt/ref/${version_reference}/${project_bed}.gz" ]; then
bgzip -f -c /opt/ref/${version_reference}/${project_bed} > /opt/ref/hg19/${project_bed}.gz
tabix -f -p bed /opt/ref/${version_reference}/${project_bed}.gz
else
if [ ! -f "/opt/ref/${version_reference}/${project_bed}.gz.tbi" ]; then
tabix -f -p bed /opt/ref/${version_reference}/${project_bed}.gz
fi
fi
mamba deactivate
if [ ! -d "${envs}/${pn}.manta" ]; then
mamba create -n ${pn}.manta -y manta=1.6.0
fi
mamba activate ${pn}.manta
rm -f ${result}/${sn}/sv/runWorkflow.py*
configManta.py \
--normalBam ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam \
--tumorBam ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam \
--referenceFasta /opt/ref/${version_reference}/${version_reference}.fasta \
--exome \
--callRegions /opt/ref/${version_reference}/${project_bed}.gz \
--runDir ${result}/${sn}/sv
rm -rf ${result}/${sn}/sv/workspace
python ${result}/${sn}/sv/runWorkflow.py -m local -j ${threads} &
mamba deactivate
mamba activate ${pn}.gatk
if [ "${version_reference}" == hg19 ]; then
echo "USE reference Version : ${version_reference}"
rm -f ${result}/${sn}/snv/vcf-file.list
touch ${result}/${sn}/snv/vcf-file.list
for i in `ls ${result}/${sn}/interval/*.interval_list`;
do
rm -f ${i%.*}_bqsr.vcf.gz
gatk Mutect2 \
-R /opt/ref/${version_reference}/${version_reference}.fasta \
-I ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam -tumor ${sn}_tumor \
-I ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam -normal ${sn}_normal \
-L $i \
-O ${i%.*}_bqsr.vcf.gz \
--max-mnp-distance 10 \
--germline-resource /opt/ref/${version_reference}/af-only-gnomad.raw.sites.b37.processed.vcf.gz \
--native-pair-hmm-threads ${threads} &
echo ${i%.*}_bqsr.vcf.gz >> ${result}/${sn}/snv/vcf-file.list
done
elif [ "${version_reference}" == "hg38" ]; then
echo "USE reference Version : ${version_reference}"
rm -f ${result}/${sn}/snv/vcf-file.list
touch ${result}/${sn}/snv/vcf-file.list
for i in `ls ${result}/${sn}/interval/*.interval_list`;
do
rm -f ${i%.*}_bqsr.vcf.gz
gatk Mutect2 \
-R /opt/ref/${version_reference}/${version_reference}.fasta \
-I ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam -tumor ${sn}_tumor \
-I ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam -normal ${sn}_normal \
-L $i \
-O ${i%.*}_bqsr.vcf.gz \
--max-mnp-distance 10 \
--germline-resource /opt/ref/${version_reference}/af-only-gnomad.hg38.vcf.gz \
--native-pair-hmm-threads ${threads} &
echo ${i%.*}_bqsr.vcf.gz >> ${result}/${sn}/snv/vcf-file.list
done
fi
wait
•
rm -f ${result}/${sn}/snv/${sn}_bqsr.vcf.gz.stats
stats=
for z in `ls ${result}/${sn}/interval/*_bqsr.vcf.gz.stats`;
do
stats="$stats -stats $z"
done
gatk MergeMutectStats $stats \
-O ${result}/${sn}/snv/${sn}_bqsr.vcf.gz.stats
gatk MergeVcfs \
-I ${result}/${sn}/snv/vcf-file.list \
-O ${result}/${sn}/snv/${sn}_bqsr.vcf.gz
mamba deactivate
9. FilterMutectCalls 对Mutect结果突变过滤
mamba activate ${pn}.gatk
gatk FilterMutectCalls \
--max-events-in-region ${event} \
--contamination-table ${result}/${sn}/stat/${sn}_contamination.table \
-R /opt/ref/${version_reference}/${version_reference}.fasta \
-V ${result}/${sn}/snv/${sn}_bqsr.vcf.gz \
-O ${result}/${sn}/snv/${sn}_filtered.vcf.gz
mamba deactivate
10. 使用Vep注释过滤结果
#conda检测环境是否存在,首次运行不存在创建该环境并安装软件
if [ ! -d "${envs}/${pn}.vep" ]; then
echo "Creating the environment ${pn}.vep"
mamba create -n ${pn}.vep -y ensembl-vep=${version_vep}
fi
mkdir -p /opt/result/${sn}/vcf
if [ "${version_reference}" == hg19 ]; then
echo "USE reference Version : ${version_reference}"
#检测vep注释数据库是否存在如果不存在则先下载
if [ ! -d "/opt/ref/vep-cache/homo_sapiens/${version_vep}_GRCh37" ]; then
aria2c -x 16 -s 48 https://ftp.ensembl.org/pub/release-${version_vep}/variation/indexed_vep_cache/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -d /opt/ref/
tar -zxvf /opt/ref/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
elif [ -f "/opt/ref/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz.aria2" ]; then
aria2c -x 16 -s 48 https://ftp.ensembl.org/pub/release-${version_vep}/variation/indexed_vep_cache/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -c -d /opt/ref/
tar -zxvf /opt/ref/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
fi
if [ ! -d "/opt/ref/vep-cache/homo_sapiens_refseq/${version_vep}_GRCh37" ]; then
aria2c -x 16 -s 48 http://ftp.ensembl.org/pub/release-${version_vep}/variation/vep/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -d /opt/ref/
tar -zxvf /opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
elif [ -f "/opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz.aria2" ]; then
aria2c -x 16 -s 48 http://ftp.ensembl.org/pub/release-${version_vep}/variation/vep/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -c -d /opt/ref/
tar -zxvf /opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
fi
mamba activate ${pn}.vep
mkdir -p ${result}/${sn}/annotation
vep \
-i ${result}/${sn}/snv/${sn}_filtered.vcf.gz \
-o ${result}/${sn}/annotation/${sn}_filtered_vep.tsv \
--offline \
--cache \
--cache_version ${version_vep} \
--everything \
--dir_cache /opt/ref/vep-cache/ \
--dir_plugins /opt/ref/vep-cache/Plugins \
--species homo_sapiens \
--assembly GRCh37 \
--fasta /opt/ref/${version_reference}/${version_reference}.fasta \
--refseq \
--force_overwrite \
--format vcf \
--tab \
--shift_3prime 1 \
--offline
mamba deactivate
elif [ "${version_reference}" == "hg38" ]; then
echo "USE reference Version : ${version_reference}"
#检测vep注释数据库是否存在如果不存在则先下载
if [ ! -d "/opt/ref/vep-cache/homo_sapiens/${version_vep}_GRCh38" ]; then
aria2c -x 16 -s 48 https://ftp.ensembl.org/pub/release-${version_vep}/variation/indexed_vep_cache/homo_sapiens_vep_${version_vep}_GRCh38.tar.gz -d /opt/ref/
tar -zxvf /opt/ref/homo_sapiens_vep_${version_vep}_GRCh38.tar.gz -C /opt/ref/vep-cache/
elif [ -f "/opt/ref/homo_sapiens_vep_${version_vep}_GRCh38.tar.gz.aria2" ]; then
aria2c -x 16 -s 48 https://ftp.ensembl.org/pub/release-${version_vep}/variation/indexed_vep_cache/homo_sapiens_vep_${version_vep}_GRCh38.tar.gz -c -d /opt/ref/
tar -zxvf /opt/ref/homo_sapiens_vep_${version_vep}_GRCh38.tar.gz -C /opt/ref/vep-cache/
fi
if [ ! -d "/opt/ref/vep-cache/homo_sapiens_refseq/${version_vep}_GRCh38" ]; then
aria2c -x 16 -s 48 http://ftp.ensembl.org/pub/release-${version_vep}/variation/vep/homo_sapiens_refseq_vep_${version_vep}_GRCh38.tar.gz -d /opt/ref/
tar -zxvf /opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh38.tar.gz -C /opt/ref/vep-cache/
elif [ -f "/opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh38.tar.gz.aria2" ]; then
aria2c -x 16 -s 48 http://ftp.ensembl.org/pub/release-${version_vep}/variation/vep/homo_sapiens_refseq_vep_${version_vep}_GRCh38.tar.gz -c -d /opt/ref/
tar -zxvf /opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh38.tar.gz -C /opt/ref/vep-cache/
fi
mamba activate ${pn}.vep
mkdir -p ${result}/${sn}/annotation
vep \
-i ${result}/${sn}/snv/${sn}_filtered.vcf.gz \
-o ${result}/${sn}/annotation/${sn}_filtered_vep.tsv \
--offline \
--cache \
--cache_version ${version_vep} \
--everything \
--dir_cache /opt/ref/vep-cache/ \
--dir_plugins /opt/ref/vep-cache/Plugins \
--species homo_sapiens \
--assembly GRCh37 \
--fasta /opt/ref/${version_reference}/${version_reference}.fasta \
--refseq \
--force_overwrite \
--format vcf \
--tab \
--shift_3prime 1 \
--offline
mamba deactivate
fi
11. 使用脚本处理注释结果和过滤vcf结果,输出和室间质评要求格式的数据表格
mamba activate ${pn}.cnvkit
if [ ! -f "${envs}/MatchedSnvVepAnnotationFilter.py" ]; then
aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/MatchedSnvVepAnnotationFilter.py -d ${envs}/
chmod a+x ${envs}/MatchedSnvVepAnnotationFilter.py
fi
${envs}/MatchedSnvVepAnnotationFilter.py \
-e normal_artifact \
-e germline \
-i strand_bias \
-i clustered_events \
--min-vaf=${snv_vaf} \
--min-tlod=${snv_tlod} \
--min-depth=${snv_depth} \
-v ${result}/${sn}/snv/${sn}_filtered.vcf.gz \
-a ${result}/${sn}/annotation/${sn}_filtered_vep.tsv \
-o ${result}/${sn}/annotation/${sn}.result.SNV.tsv
mamba deactivate
12. 使用cnvkit提供工具输出分布图和热图
mamba activate ${pn}.cnvkit
cnvkit.py scatter ${result}/${sn}/cnv/${sn}_tumor_marked.cnr \
-s ${result}/${sn}/cnv/${sn}_tumor_marked.cns \
-i ' ' \
-n ${sn}_normal \
-o ${result}/${sn}/cnv/${sn}_cnv_scatter.png -t &&
cnvkit.py heatmap ${result}/${sn}/cnv/${sn}_tumor_marked.cns \
-o ${result}/${sn}/cnv/${sn}_cnv_heatmap.png
mamba deactivate
13. 使用cnvkit call 根据cnvkit batch输出结果推算拷贝数
mamba activate ${pn}.cnvkit
cnvkit.py call ${result}/${sn}/cnv/${sn}_tumor_marked.cns \
-o ${result}/${sn}/cnv/${sn}_tumor_marked.call.cns
mamba deactivate
14. 编写脚本处理cnvkit输出,计算cnv基因,exon位置,gain/lost,cn数
mamba activate ${pn}.cnvkit
if [ ! -f "${envs}/CnvAnnotationFilter.py" ]; then
aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/CnvAnnotationFilter.py -d ${envs}/
chmod a+x ${envs}/CnvAnnotationFilter.py
fi
if [ ! -f "/opt/ref/${version_reference}/refGene.txt" ]; then
aria2c -x 16 -s 16 http://hgdownload.cse.ucsc.edu/goldenPath/${version_reference}/database/refGene.txt.gz -d /opt/ref/${version_reference} -o refGene.txt.gz
cd /opt/ref/${version_reference} && gzip -d refGene.txt.gz
fi
python ${envs}/CnvAnnotationFilter.py \
-r /opt/ref/${version_reference}/refGene.txt \
-i ${cnv_min} \
-x ${cnv_max} \
-D ${cnv_dep} \
-f ${result}/${sn}/cnv/${sn}_tumor_marked.call.cns \
-o ${result}/${sn}/cnv/${sn}.result.CNV.tsv
mamba deactivate
15. 编写脚本处理manta的输出,获取最终sv输出结果,起始位置,基因、频率等
mamba activate ${pn}.cnvkit
if [ ! -f "${envs}/SvAnnotationFilter.py" ]; then
aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/SvAnnotationFilter.py -d ${envs}/
chmod a+x ${envs}/SvAnnotationFilter.py
fi
if [ ! -f "/opt/ref/${version_reference}/refGene.txt" ]; then
aria2c -x 16 -s 16 http://hgdownload.cse.ucsc.edu/goldenPath/${version_reference}/database/refGene.txt.gz -d /opt/ref/${version_reference} -o refGene.txt.gz
cd /opt/ref/${version_reference} && gzip -d refGene.txt.gz
fi
${envs}/SvAnnotationFilter.py \
-r /opt/ref/${version_reference}/refGene.txt \
-s ${sv_score} \
-f ${result}/${sn}/sv/results/variants/somaticSV.vcf.gz \
-o ${result}/${sn}/sv/${sn}.result.SV.tsv
mamba deactivate
16. 根据之前fastp,samtools depth,samtools flagstat,gatk CollectInsertSizeMetrics等输出,给出综合 QC数据
mamba activate ${pn}.cnvkit
if [ ! -f "${envs}/MatchedQcProcessor.py" ]; then
aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/MatchedQcProcessor.py -d ${envs}/
chmod a+x ${envs}/MatchedQcProcessor.py
fi
${envs}/MatchedQcProcessor.py --bed /opt/ref/${version_reference}/${project_bed} \
--out ${result}/${sn}/stat/${sn}.result.QC.tsv \
--sample-fastp=${result}/${sn}/trimmed/${sn}_tumor_fastp.json \
--sample-depth=${result}/${sn}/stat/${sn}_tumor_marked.depth \
--sample-flagstat=${result}/${sn}/stat/${sn}_tumor_marked.flagstat \
--sample-insertsize=${result}/${sn}/stat/${sn}_tumor_insertsize_metrics.txt \
--normal-fastp=${result}/${sn}/trimmed/${sn}_normal_fastp.json \
--normal-depth=${result}/${sn}/stat/${sn}_normal_marked.depth \
--normal-flagstat=${result}/${sn}/stat/${sn}_normal_marked.flagstat \
--normal-insertsize=${result}/${sn}/stat/${sn}_normal_insertsize_metrics.txt
mamba deactivate
mamba activate ${pn}.fastp
multiqc ${result}/${sn}/ -f -o ${result}/${sn}/qc
mamba deactivate
最终输出
文件名
备注
report.pdf
由sliverworkspace设计的图形化分析报告打印输出为PDF
1701/qc/multiqc_report.html
multiqc聚合报告
1701/1701.result.SNV.tsv
SNV最终突变结果
1701/1701/cnv/1701_cnv_heatmap.png
CNV结果热图
1701/cnv/1701_cnv_scatter.png
CNV结果分布图
1701/cnv/1701.result.CNV.tsv
CNV最终结果
1701.result.SV.tsv
SV最终结果
1701.result.QC.tsv
最终质控结果
1701/qc/multiqc_report.html
multiqc报告
原文地址:https://zhuanlan.zhihu.com/p/659838203
楼主热帖
小桔灯网业务合作须知!
如何注册小桔灯网VIP会员?
ce是3c认证吗? 两大体系区别一览
[
CE注册
]
流式细胞术原理、实验步骤及应用概览
[
流式细胞技术
]
FDA 注册 (第四篇):FDA 510K 产品注册
[
FDA注册
]
本科动物医学 硕士微生物/生物医学可以考检验技师或微生物技术吗?
[
微生物检验
]
干货分享 | 免疫组化实验
[
免疫组化技术
]
执业药师大家都是怎么复习的?
[
执业药师
]
从硬件谈质谱之信号篇-4
[
质谱技术
]
2025年全球宠物血液分析仪市场规模约1.2亿美元 便携式产品需求将持续增长[图]
[
血型分析仪
]
回复
使用道具
举报
提升卡
返回列表
发表回复
高级模式
B
Color
Image
Link
Quote
Code
Smilies
您需要登录后才可以回帖
登录
|
立即注册
本版积分规则
发表回复
回帖后跳转到最后一页
浏览过的版块
中标结果
关闭
官方推荐
/3
【扫描左侧二维码关注微信】参与交流!
网站定期开展行业相关话题互动交流活动!对认真参与讨论的桔友将有金桔奖励!欢迎参与。
查看 »
IVD业界薪资调查(月薪/税前)
长期活动,投票后可见结果!看看咱们这个行业个人的前景如何。请热爱行业的桔友们积极参与!
查看 »
小桔灯网视频号开通了!
扫描二维码,关注视频号!
查看 »
返回顶部
快速回复
返回列表
客服中心
搜索
官方QQ群
洽谈合作
关注微信
微信扫一扫关注本站公众号
个人中心
个人中心
登录或注册
业务合作
-
投稿通道
-
友链申请
-
手机版
-
联系我们
-
免责声明
-
返回首页
Copyright © 2008-2024
小桔灯网
(https://www.iivd.net) 版权所有 All Rights Reserved.
免责声明: 本网不承担任何由内容提供商提供的信息所引起的争议和法律责任。
Powered by
Discuz!
X3.5 技术支持:
宇翼科技
浙ICP备18026348号-2
浙公网安备33010802005999号
快速回复
返回顶部
返回列表