2. QC Filtering

1. Background
2. Produce input data
2. Individual call rate
3. Marker metrics and filtering
- 3.1 Calculate genotype counts, call rate and maf
- 3.2 Call Rate

1. Background

In the file 1_DataPrep_Stacks I produced a catalogue of loci from raw fastQ files for all Litoria serrata/L. myola samples we have sequences thus far.

Here I will assess the Henrietta Creek (HC) linkage family presented in the paper, looking at quality metrics for the different markers. I will then filter both individuals and markers by Call Rate, and then filter markers by MAF.

2. Produce input data

The first necessary step is to export family specific data from the catalog. I do this using the populations module of stacks, ran within the script 3_5_populations_linMap.sh.

#!/bin/bash

#PBS -N populations-array
#PBS -l ncpus=20
#PBS -l mem=10g

##### Change Directory #####
cd ~/ProjectDirectory/1_DataPrep

##### Load stacks #####
module load stacks/2.55

log_file=3_5_populations_linMap_HC.oe
files_dir=3_calling/1_ustacks/
out_dir=3_calling/2_outputs/LM_HC

##### run populations #####
populations -P $files_dir -M 0_metadata/3_5_popfile_linMap_HC --out-path $out_dir -t 20 -R 0.5 --min-maf 0.02 --v

Here, the populations command retains only markers present in more than 50% of samples, and with a minor allele frequency larger than 0.02, or 2% (i.e., the minor allele has to be present in at least 2% of samples.)

For the HC family we removed markers where the minor allele was present in less than 7-13 individuals (317 individuals total, 634 alleles).

Note that the range depends on whether the minor allele is present only as homozygote genotypes (lower bound, 14 counts for the rare allele in 7 individuals) or only in heterozygote genotypes (upper bound, 14 count for the rare allele in 14 individuals).

I then submitted the job with:

qsub -v "FAM=HC" ./0_scripts/3_5_populations_linMap.sh

Then I want to export to filter further, to retain only bi-allelic alleles, and markers with mean read depth between 7 and 50, as well as silencing individual genotype calls outside these bounds.

I do so with vcftools with the following script

#!/bin/bash

#PBS -N vcftools-filtering
#PBS -l ncpus=1
#PBS -l mem=10g

##### Change Directory #####
cd ~/ProjectDirectory/1_DataPrep

##### Load vcftools #####
##### note, you need to load conda and then activate the vcftools environment 
module load conda3
source $CONDA_PROF/conda.sh
conda activate vcftools-0.1.16
module load vcftools

log_file=3_6_vcftools_LM_HC.oe
files_dir=./3_calling/2_outputs/LM_HC

##### run vcftools #####
vcftools --vcf $files_dir/populations.snps.vcf --out $files_dir/3_6_vcftools_HC --min-alleles 2 --max-alleles 2 --min-meanDP 5 --max-meanDP 50 --minDP 5 --maxDP 50  --012 &> 0_logfiles/$log_file

Submitted with the command:

qsub -v "FAM=HC" 0_scripts/3_6_vcftools_LM.sh

The -v "FAM=HC" part is required as I prepared these scripts to work with more families not presented in this study.

This produces 3 files for the HC family:

3_6_vcftools_HC.012: contains genotype data in a matrix in 012 format, where the 012 indicate the number of snp alleles.
3_6_vcftools_HC.012.indv: contains individual ids
3_6_vcftools_HC.012.pos: contains the marker ids

Let’s turn these 3 files into a single dataframe. I produced a shell script that does this, using the 3 files mentioned above as input.

#obtain FAM argument from first argument provided to script
GENO="$1"
INDV=${GENO}.indv
LOCI=${GENO}.pos

# remove first column
cut -d$'\t' -f2- $GENO > temp1

# add row names - sample ids
paste -d$'\t' $INDV temp1 > temp2

# replace tabs with underscores
sed -e 's/\t/_/g' $LOCI > temp3

# add 'sampleID' on top of locus id
echo "sampleID" > temp4
cat temp4 temp3 > temp5

#transpose locus ids by replacing newline with tab, then replace last tab with newline
awk 'BEGIN { ORS = "\t" } { print }' temp5 | sed 's/\t$/\n/' > temp6

# add locus id to 012 file
cat temp6 temp2 > ${GENO}.all

I submitted it with the below command. The for loop is not necessary, I just use it to iterate between the four families. The key is that the three .012 files for each population/family need to have the same prefix.

./0_AddNamesTo012.sh 3_6_vcftools_HC.012

It produces a file 3_6_vcftools_HC.012.all. I will now read this into R to assess locus and individual statistics.

2. Individual call rate

Here I read the input data from vcftools, and assess call rate across individuals. I will remove individuals with too low call rate, as well as retaining only one replicate per replicate pair, retaining the one with the highest call rate.

library(tidyverse)

## Warning: package 'tidyr' was built under R version 4.0.5

library(janitor)
library(stringr)
library(gridExtra)
library(ggpubr)

my.df <- read_delim("../2_lepmap/1_filtering/3_6_vcftools_HC.012.all",
                    delim = "\t")

Let’s check how many individuals and markers there are

dim(my.df)

## [1]   389 17799

There are 389 individuals (this includes replicates) and 17,799 markers. Now, let’s calculate call rate for each individual

CallRate <- 1 - rowSums(my.df[,2:ncol(my.df)]=="-1")/(ncol(my.df)-1)

CallRate <- tibble("sampleID"=my.df$sampleID,
                 "callRate"=as.numeric(CallRate))

Plot call rate by individual

p1 <- ggplot(CallRate) +
  geom_boxplot(aes(y=callRate)) +
  ylim(0.25,1) + theme_classic() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

p1

Most individuals sit at a call rate between 0.75 and 0.9. HC family has a few individuals below 0.5. We would probably want to remove these. Let’s look at which individuals fall below a call rate of 0.6.

CallRate %>% filter(callRate < 0.6) %>% pull(sampleID)

##  [1] "LS_LM_HC_090"   "LS_LM_HC_118_b" "LS_LM_HC_123_a" "LS_LM_HC_139_b"
##  [5] "LS_LM_HC_148_b" "LS_LM_HC_152_b" "LS_LM_HC_153_b" "LS_LM_HC_154_b"
##  [9] "LS_LM_HC_265_b" "LS_LM_HC_274_b" "LS_LM_HC_275_b" "LS_LM_HC_278"  
## [13] "LS_LM_HC_311"

Most (10/13) of the individuals with call rate < 0.6 are replicates (as denoted by the *_b* or *_a* at the end). Let’s then have a closer look at replicates. First we need to extract whether samples are replicates or not, by using the *_a* and *_b* denotation at the end of sample IDs.

CallRate <- CallRate %>%
    mutate(replica=str_extract(sampleID, "[^_]+$")) %>%
    mutate(replica=case_when(replica == "a" ~ "a",
                             replica == "b" ~ "b",
                             TRUE ~ "norep"))

Now, let’s plot call rate for *_a* and *_b* ids.

p2 <- ggplot(CallRate) +
    geom_boxplot(aes(y=callRate, x=replica)) +
    ylim(0,1)

p2

It seems that, in general, *_b* replicates have a lower call rate. But do they have a lower call rate than their respective *_a* replicate? Let’s assess.

CallRate2 <- CallRate %>%
    filter(replica != "norep") %>%
    mutate(sampleID = str_remove(sampleID, "_a$|_b$")) %>%
    pivot_wider(names_from = replica, values_from = callRate)

#also create a vector of all individuals without replicates
#which we want to retain
inds.to.retain <- CallRate %>%
    filter(replica == "norep") %>%
    pull(sampleID)

Now let’s plot a vs b replicates

p3 <- ggplot(CallRate2) +
    geom_point(aes(y=a, x=b)) +
    ylim(0,1) +
    xlim(0,1) +
    geom_abline(intercept = 0, slope = 1)

p3

## Warning: Removed 10 rows containing missing values (geom_point).

Well, apparently it’s not always the case that _a replicates have higher call rate than _b replicates. So, let’s assess which one is higher for each sample, and then create a vector containing only those samples with a higher call rate.

CallRate2 <- CallRate2 %>%
  mutate(best = case_when(is.na(a) == TRUE ~ "b",
                          is.na(b) == TRUE ~ "a",
                          a >= b ~ "a",
                          a < b ~ "b")) %>%
  mutate(sampleID = paste0(sampleID,"_",best))

inds.to.retain2 <- CallRate2 %>% pull(sampleID)

Merge individual IDs for samples with and without replicates that we want to retain

inds.to.retain <- sort(c(inds.to.retain,inds.to.retain2))

rm(inds.to.retain2)

And finally filter accordingly

my.df <- my.df[(my.df$sampleID %in% inds.to.retain),]

temp <- as_tibble(CallRate[(CallRate$sampleID %in% inds.to.retain),])

Now, let’s look at the distribution of call rate once more

p4 <- ggplot(temp) +
    geom_boxplot(aes(y=callRate)) +
    ylim(0,1)

p4

The last thing that is left is to filter samples with a call rate well outside the expected range. Let’s look at how many individuals would still be filtered with a call rate threshold of 0.6

temp %>% filter(callRate <= 0.6)

## # A tibble: 3 × 3
##   sampleID     callRate replica
##   <chr>           <dbl> <chr>  
## 1 LS_LM_HC_090    0.567 norep  
## 2 LS_LM_HC_278    0.476 norep  
## 3 LS_LM_HC_311    0.369 norep

We will use 0.6 as a filter.

inds.to.retain <- CallRate %>%
    filter(callRate >= 0.6) %>%
    pull(sampleID)

my.df <- my.df[(my.df$sampleID %in% inds.to.retain),]

Finally, save the resulting file.

#saveRDS(my.df, file = #"../1_DataPrep/2_1_FilteredIndividuals_HC.RData")

3. Marker metrics and filtering

Now that we have filtered out by individual, we can move onto assessing call rate and maf across markers. The first thing to do is to transpose to have individuals as columns instead of rows.

Read in the filtered data from the previous step (I have a lot of these intermediary steps to speed up re-running the code from a certain spot and for insurance in case anything goes wrong. There are probably better ways to do it but this is how I know to do it).

my.df <- readRDS("../1_DataPrep/2_1_FilteredIndividuals_HC.RData")

Transpose

my.df <- as_tibble(t(my.df),
                   rownames = "row_names")

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Turn first row into column names.

my.df <- my.df %>%
  row_to_names(row_number = 1)

Remove all whitespaces produced by t.

my.df <- as_tibble(apply(my.df, 2, str_remove_all, " "))

Convert all remaining columns (which include 0,1,2 and -1) to numeric, except for the sampleID column (which actually contains markerIDs)

my.df <- my.df %>%
    mutate_at(vars(-("sampleID")),as.numeric) %>%
    rename(markerID = sampleID)

Save the resulting file

#saveRDS(my.df, file = #"../1_DataPrep/2_2_InputMarkerData_HC.RData")

Now, let’s calculate a few key metrics. In particular, we want to calculate Call Rate, MAF, and count the different genotypes.

We will add the calculated values to a second list of dataframes, so that we can retain one dataframe with only genotypes, and include all relevant metrics in the second dataframe.

3.1 Calculate genotype counts, call rate and maf

Clear the environment

rm(list = ls())

Read in trasposed data filtered by individual

my.df <- readRDS(file = "../1_DataPrep/2_2_InputMarkerData_HC.RData")

We want to calculate this on the offspring genotype only, as we will use the genotype counts to assess deviation from the expected patterns of mendelian inheritance. So, as a first step, we subset each list to contain only the offspring. Parents are denoted by FAT and MOT, so we can use the dplyr::select to exclude such columns.

my.off <- my.df %>% select(!matches("FAT|MOT"))

markerID <- my.off %>% select(markerID)
Zero <- rowSums(my.off=="0") # reference homozygote
One <- rowSums(my.off=="1") # heterozygote
Two <- rowSums(my.off=="2") # snp homozygote
Mis <- rowSums(my.off=="-1") #missing data
CallRate <- 1 - rowSums(my.off=="-1")/(ncol(my.off)-1)
MAF <- (rowSums(my.off=="1")+(rowSums(my.off=="2")*2))/(((ncol(my.off)-1)*2)-rowSums(my.off=="-1"))

metadata <- as_tibble(cbind(markerID,Zero,One,Two,Mis,CallRate,MAF))

3.2 Call Rate

p1 <- ggplot(metadata) +
  geom_histogram(aes(x=CallRate)) +
  theme_classic()

p1

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that most markers have a Call Rate > 0.5%. While for a pop gen dataset we might want a higher threshold, for these families we have 100-300 individuals per family. Thus, even if only 50% of offspring were genotyped successfully, we would still retain the marker IF the genotypes of the offspring matched the expected mendelian segregation (will check in next step).

The markers having a Call Rate much lower than 0.5 are likely a result of silencing individual genotypes with vcftools when they had less than 7 reads. We will remove those markers. But first, for informativeness, let’s count how many fall below that threshold for each family.

nrow(metadata %>% filter(CallRate < 0.6))

## [1] 1337

Let’s display that as a proportion of total markers genotyped for each family.

nrow((metadata %>% filter(CallRate < 0.6)))/nrow(metadata)

## [1] 0.0751208

When silencing genotype calls with less than 5 reads, and with a call rate threshold of 0.6, the proportion is 7.5%.

When we silenced genotype calls with less than 7 reads (prel analyses), the proportion was around 8% of markers when using 0.5 as a threshold, or around 22% when using 0.6. Later inspection of segregation distortion showed how many problematic markers had a call rate between 0.5 and 0.6, thus I decide to go for 0.6.

We will then add a FilterStatus column to our metadata dataframes, indicating whether the marker should be retained (retain) or filtered. For markers to be filtered we will add the name of the filter that removed them (e.g., for Call Rate we will add callrate, for MAF we will ad maf).

metadata <- metadata %>%
    mutate(filterStatus = case_when(
      CallRate >= 0.6 ~ "retain",
      CallRate < 0.6 ~ "callrate")) %>%
  mutate(filterStatus = case_when(
    MAF < 0.02 ~ "maf",
    TRUE ~ filterStatus))

Now let’s produce list of names of the markers to be retained.

mrks.to.retain <- metadata %>%
    filter(filterStatus == "retain") %>%
    pull(markerID)

Let’s see what percentage of markers we would retain with the current filters

length(mrks.to.retain)/nrow(my.df)*100

## [1] 91.83616

And how many markers would be left

length(mrks.to.retain)

## [1] 16345

Let’s use those lists to filter our dataframes.

my.df <- my.df[(my.df$markerID %in% mrks.to.retain),]

my.off <- my.off[(my.off$markerID %in% mrks.to.retain),]

metadata <- metadata[(metadata$markerID %in% mrks.to.retain),]

Let’s see how many markers are left for each family now

nrow(my.df)

## [1] 16345

This should match between my.list (includes parents) and my.off (only offsprings).

nrow(my.df)==nrow(my.off)

## [1] TRUE

This is the dataframe we will use to assess segregation distortion and thus deviation from the expected patterns of mendelian inheritance.

Let’s save all three files

#saveRDS(my.df,file = #"../1_DataPrep/2_3_FilteredMarkerData_withParents_HC.RData")
#
#saveRDS(my.off,file = #"../1_DataPrep/2_3_FilteredMarkerData_withoutParents_HC.RData")
#
#saveRDS(metadata,file = #"../1_DataPrep/2_3_FilteredMarkerData_metadata_HC.RData")