Follow this, you can load the VCF file into R, and do PCA to seperate them, to check how close your samples are.
step1: load the vcf file into R
## you can download the vcf file from 1000 genomes project. if(F){ library(vcfR) vcf_file='/gatk/germline/merge.vcf' ### 直接读取群体gvcf文件即可 vcf <- read.vcfR( vcf_file, verbose = FALSE ) save(vcf,file = 'example_vcf.Rdata') } f='example_vcf.Rdata' load(file = f) vcf@fix[1:4,1:4] vcf@gt[1:14,1:4] colnames(vcf@gt)
step2: load the selected SNP positions into R
bed=read.table('SNPbeds/SNP_GRCh38_hg38_wChr.bed',header = F,stringsAsFactors = F) bed[,2]=trimws(bed[,2]) bed[,3]=trimws(bed[,3]) # please make sure that which column is the position in your vcf file. # In this case, it's second column, but in your case, it might be third column. # so use bed[,c(1,3)] instead of bed[,c(1,2)] need_pos=apply( bed[,1:2] ,1,function(x) paste0(x,collapse = '-')) all_pos=apply( vcf@fix[,1:2] ,1,function(x) paste0(x,collapse = '-')) table(all_pos %in% need_pos ) filter_vcf=vcf[all_pos %in% need_pos ] filter_vcf
step3: get the genotype information matrix for all the samples