1. Calling loci with stacks

1. Background
- 1.2 Information on the sequencing run
2. Directory structure
3. Stripping barcodes and adapter sequences, and removing low quality sequences
4. Stacks population files
5. Fine tuning stacks
6. Running stacks with final parameter set with Gapped alignments
7. Catalogue summary and conclusion

1. Background

As part of a larger project C. J. Hoskin and M. Higgie sampled a pair of Green-eyed Treefrog in amplexus from Henrietta Creek. They found the pair in amplexus, placed them in an enclosed container set up to mimic their breeding conditions and waited until they laid a clutch of eggs. They then proceeded to collect tissue samples from both parents by using sterile surgical scissors to remove a single toe pad from toe II (i.e., second innermost) on the right foot and preserved it in 95% ethanol. They then retained the laid clutch until tadpoles hatched and developed to stage X. At this stage they euthanized 300 of the tadpoles and preserved them in 95% ethanol, releasing the remaining tadpoles back at the collection site.

1.2 Information on the sequencing run

All samples were sequenced at Diversity Arrays Technologies (DArT) in Carnberra, ACT, Australia with the DArTseq protocol. DArTseq is a double-digest genome complexity reduction method related to rad-seq and gbs. Briefly, the extracted DNA is first sheared and ligated in one step, during which two methylation sensitive enzymes shear the DNA, and then barcodes with primer sequence are ligated to the RE overhangs. For Litoria, the largest genus of Australian tree-frogs, the enzymes PstI and SphI are used. Samples then undergo PCR to increase the amount of input DNA, and are then sequenced on an Illumina HiSeq. An average of 1.25 million reads per sample get sequenced.

The expected RE overhangs should then be:

PstI: TGCAG, located on the 5’ end
SphI: CATG, located on the 3’ end

Furthermore, Illumina adapters are also ligated. For this project, the following universal illumina adapter was used:

AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG

The fastq files are provided by dart already demultiplexed (i.e., one file per individual) but with the barcodes still present. The structure of each individual read will then be:

Note that the last two elements, the SphI overhang and Illumina adapter, will be present in ~ 25% of the reads only. That is because the restriction fragments vary in size from 20 bp to longer than the sequencing length.

Because we do not have a genome assembly yet for this species (or any hylid frog for that matter), we will be performing filtering of fastq files and calling of genotypes using stacks v2.

2. Directory structure

The directory structure for this project included one parent ProjectDirectory folder. Within this I had three main subfolders: - 1_DataPrep for the stacks and filtering steps - 2_lepmap for the linkage mapping steps - 3_sex for the sex-linkage analyses

Within each of these I usually have the following default folders:

0_scripts to save the bash scripts
0_Rscripts to save R scripts if I use any
0_metadata for any metadata file (e.g., stacks popfiles, job arrays parameter files, etc)
0_logfiles to save all of the logfiles

For the sake of simplicity in this Supplementary I include all scripts within the supplementary itself (see examples below).

3. Stripping barcodes and adapter sequences, and removing low quality sequences

This step was conducted with the process_radtags module, with the following script:

#!/bin/bash

#PBS -N processRadtags
#PBS -l nodes=1:ppn=2
#PBS -l mem=4gb
#PBS -l walltime=99:00:00

#set working directory
#this shold be set to the folder containing all the fastq files from the Illumina run
cd ~/ProjectDirectory/1_DataPrep

#load stacks module
#The JCU HPC used to work with modules
#It now switched to singularities. Adjust accordingly if replicating this step
module load stacks

#set adapter sequence
adapter=AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG

#run process_radtags module
process_radtags -p ./path/to/folder/of/fastq/files/ -o ./output/folder/ -b ./path/to/barcodes/file -c -q -r \
--inline_null --renz_1 pstI --renz_2 sphI --adapter_1 $adapter --adapter_mm 1 \
&> .path/to/logfile.oe

4. Stacks population files

DArT produces replicates of ~15% of submitted samples. These are technical replicates. The DNA extraction is the same, but library preparation is done independently for the two replicates (i.e., different barcodes for each library prep for the same individual being replicated). For our linkage mapping parents, we have done up to 4 individual replicates for each parent. This was to ensure high read depth for the parents, which will then increase the accuracy of our linkage maps. To take advantage of this, I will merge the replicates for the parents. I did so with a simple cat command:

#father
cat LS_LM_HC_FAT_M1_* > LS_LM_HC_FAT_M1.fq.gz
#mother
cat LS_LM_HC_MOT_F1_* > LS_LM_HC_MOT_F1.fq.gz

I then removed the individual replicates with rm *_FAT_M1_* and rm *_MOT_F1_*.

Because of my “complex” design (i.e., 1 linkage map families and ~40 adult individuals) I decided not to use the denovo_map.pl module, as it would include all individuals to build the catalogue. Instead, following the stacks manual, section 4.4.3, I decided to build the catalogue with all adult individuals and the two linkage mapping cross parents, but without the offspring. From the manual:

Since the parents of a mapping cross contain all possible alleles that can occur in the progeny, it makes sense to build the catalog from only the parents (thus excluding error from the progeny).

To achieve this, I first have to produce six population maps:

one with all samples
one with adults and parents only for building the catalogue
one for each individual family cross

For the one containing all samples, I simply used the below command, which lists all fq.gz file in the 1_cleaned directory, removes the extension and replaces it with all. The pop identifier all and the sample ID are separated by a tab.

echo 1_cleaned/*fq.gz | sed 's/.fq.gz /\tall\n/g' | sed 's/.fq.gz/\tall/g' > 0_metadata/0_popfile_all

I then manually removed the unnecessary ones from the other pop files manually. Additionally, for linkage mapping family only pop file I changed all to progeny for all offspring, and to parent for the two parent of that cross. A popfile for an individual cross would then look like:

LS_LM_HC_307    progeny
LS_LM_HC_308    progeny
LS_LM_HC_309    progeny
LS_LM_HC_310    progeny
LS_LM_HC_311    progeny
LS_LM_HC_312    progeny
LS_LM_HC_314    progeny
LS_LM_HC_315    progeny
LS_LM_HC_FAT_M1 parent
LS_LM_HC_MOT_F1 parent

Finally, following the standard pipeline for stacks, before calling genotypes on the whole set of individuals we will run the pipeline with a representative subset of individuals and across a number of values for the M, m and n parameters. This will allow us to assess number of loci and snps identified across parameter space, thus gaining an insight into the best set of parameter values. Additionally, we will make use of DArT replicates, and estimate error rate using the replicates method of TIGER.

I will thus need subsets of these population files. I simply removed unwanted individuals and saved the pop file by adding *_error* at the end, within the same metadata folder. To have a representative subsample, I kept: - all linkage map parents - 5 individuals per linkage map family, including only individuals that had replicates - 1 sex per sampling locality, including individuals that had replicates where possible

This resulted in a total of 96 fastq files, and 66 individuals (i.e., 30 individuals were replicated, 20 from linkage mapping families and 10 from adult sexed individuals).

5. Fine tuning stacks

There are 3 main parameters we want to investigate. They are the M, m and n parameter. The first two relate to the ustacks module of stacks, and refer to the maximum distance allowed between stacks, and the minimum depth of coverage required to build a stack, respectively. The last one refers to the maximum number of mismatches allowed between sample loci when building the catalogue. We want to explore the following values for M, m and n respectively: 1-10, 2-15 and 1-5. We will do so while keeping the other values at their default (2, 3 and 1 respectively). I decided not to investigate an m value of 1 as it was taking a veeery long time to run and it is not a very sensible value anyway.

All output from this fine tuning sections are stored within the 2_error folder, and each parameter combination is saved within the subfolder parTest.\(par\)value, where par is either M, m or n, and value is the series of values investigated for each parameter. For instance, for the n parameter we have 5 folders: parTest.n1, parTest.n2, parTest.n3, parTest.n4, parTest.n5.

5.1. ustacks

The first step will be to run ustacks. Because of the number of individuals and combinations of M and m, I decided to submit a job for each individual and parameter combination (e.g., Ind1 M1, Ind1 M2, … , Ind1 m2, …, IndN m15). To achieve this I used job arrays.

The first step was to produce matrices of all possible combinations, with the format tab. I produced one for each of the three parameters M and m. For n we didn’t need to do it sample by sample as the steps where samples are addressed individually is during the ustacks module, but for differing values of n we used the default values of M and m.

First I produce files containing the parameters required for the jobs within the job array, which are the sample ID and the value of M/m.

#loop through sample ids in popfile metadata file (see above)
#this is retrieved from first column, f1, of file
for SAMPLE_ID in $(cut -f1 0_metadata/0_popfile_all_error)
do
#loop through values of M
  for M in 1 2 3 4 5 6 7 8 9 10
    do
      printf "%s\t%s\n" $SAMPLE_ID $M
    done
    #write to file
done > 0_metadata/2_2_ustacks_array_M

for SAMPLE_ID in $(cut -f1 0_metadata/0_popfile_all_error)
do
  for m in 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    do
      printf "%s\t%s\n" $SAMPLE_ID $m
    done
done > 0_metadata/2_2_ustacks_array_m

Then, because the JCU high-performance research cluster (HPRC) has a limit of 200 jobs per array, I had to split my job arrays. This would not be necessary if you didn’t have the limit. I split the files I just produced with the below command. Then I submitted one job for each of those files.

split -da 4 -l 192 0_metadata/2_2_ustacks_array_M 0_metadata/2_2_ustacks_array_M_ --additional-suffix=".log"

split -da 4 -l 180 0_metadata/2_2_ustacks_array_m 0_metadata/2_2_ustacks_array_m_ --additional-suffix=".log"

For the ustack module the job array script looked like this:

#!/bin/bash

#PBS -N ustacks-array
#PBS -l ncpus=1
#PBS -l mem=2g
#PBS -J 1-192

##### Change Directory to stacks working directory #####
cd ~/ProjectDirectory/1_DataPrep

##### Obtain parameters from par file #####
parameters=`sed -n "${PBS_ARRAY_INDEX} p" 0_metadata/2_2_ustacks_array_0001.log`

parameterArray=($parameters)

SAMPLE_ID=${parameterArray[0]}
M=${parameterArray[1]}

##### Load stacks #####
module load stacks

##### Set log file, out dir and unique id for the sample and sample output #####
log_file=2_2_ustacks.$SAMPLE_ID.$M.oe
out_dir=2_error/2_ustacks/parTest.M$M
unique_id=$((${PBS_ARRAY_INDEX} + 192)) # for first script I didn't add, for second I added 192, for third I added 192*2, etc.

##### run ustacks #####
ustacks -f 2_error/1_input/$SAMPLE_ID.fq.gz -i $unique_id -o $out_dir -M $M -m 3 -N 4 &> ./0_logfiles/$log_file

These scripts are all saved with the naming scheme 2_2_ustacks_error_M_0001.sh. The M part represents the parameter being tested (M or m), and the 0001 represent what set of the parameter and individual combination is being run. For instance, for M we had 960 combination, but a 200 limit on number of jobs in an array, so I divided them into 5 jobs of 192 each.

Once everything has run, as recommended by Rochette and Catchen, I always check for errors in my logfiles with the command grep -iE "\b(err|e:|warn|w:|fail|abort)" *.oe. As you can see in the above array script, I save all my logfiles within the 0_logfiles directory, so either run the command from that directory or include the path in the command. If nothing is returned, then it means no errors were reported in the log files.

5.2 cstacks to gstacks

Now we can run the modules from cstacks to gstacks. There’s a couple of things though that we have to remember. First of all, we have to ensure we use the right pop files, following the stacks manual section on using stacks on genetic crosses. Secondly, while for the M and m parameter we can just run the remaining modules, for the n parameters we will also have to run the cstacks module with each of the 5 values we are exploring.

Once again, we will take advantage of array jobs. Thus the first step is to produce the files we will be using. In this case we want a file with the format tab, where PARAMETER is either M, m or n.

for M in {1..10}
do
  printf "%s\t%s\n" M $M
done > 0_metadata/2_3_cstacks_array

for m in {2..15}
do
  printf "%s\t%s\n" m $m
done >> 0_metadata/2_3_cstacks_array

for n in {1..5}
do
  printf "%s\t%s\n" n $n
done >> 0_metadata/2_3_cstacks_array

Then we can submit the below job array script with qsub

#!/bin/bash

#PBS -N ustacks-array
#PBS -l ncpus=8
#PBS -l mem=4g
#PBS -J 1-30

##### Change Directory #####
cd ~/ProjectDirectory/1_DataPrep

##### Obtain parameters from par file #####
parameters=`sed -n "${PBS_ARRAY_INDEX} p" 0_metadata/2_3_cstacks_array`

parameterArray=($parameters)

PAR=${parameterArray[0]}
VALUE=${parameterArray[1]}

##### Load stacks #####
module load stacks/2.55

log_file=2_3_cstacks.$PAR.$VALUE.oe
files_dir=2_error/2_ustacks/parTest.$PAR$VALUE/

##### run cstacks #####
############################################################################
# Note here how, if the parameter is n we change the value of n in cstacks #
# otherwise we keep it at its default of 1. Furthermore, note that here    #
# the pop file doesn't include the offspring, while from sstacks onwards   #
# it does.                                                                 #
############################################################################
if [ $PAR == 'n']
then
    cstacks -n $PAR -P $files_dir -M 0_metadata/0_popfile_parents_error -p 8
else
    cstacks -n 1 -P $files_dir -M 0_metadata/0_popfile_parents_error -p 8
fi

##### run sstacks #####
sstacks -P $files_dir -M 0_metadata/0_popfile_all_error -p 8

##### run tsv2bam #####
tsv2bam -P $files_dir -M 0_metadata/0_popfile_all_error -t 8

##### run gstacks #####
gstacks -P $files_dir -M 0_metadata/0_popfile_all_error -t 8

We have now completed runnning the exploratory analyses with stacks assessing a range of M, m and n values. Next, we will move onto calculating error rate for these different runs.

5.3 Calculate error rate with Tiger

A great thing about DArTseq data, is that 15% of samples from each sequencing order are replicated. This is done by conducting library preparation and sequencing twice for each of those replicated samples. If the order is large enough ~ >= 188 samples, those replicates are also ran on a separate flowcell.

We can use this type of data to estimate genotyping error rate (or more accurately the mismatch between two replicates) using Tiger. Tiger is described in this paper. The software wiki can be found here.

To run Tiger, we need a txt file containing a header, followed by 3 columns:

filename
replicate group
batch group (i.e. OrderCellLane)

I produced this by hand, and saved it as 2_6_tiger_metadata.txt, within the folder 2_error/6_tiger/.

It looks like this:

cat ../1_stacks/2_error/6_tiger/2_6_tiger_metadata.txt | head

## genotype genoUnique  OrderCellLane
## LM_LM_KI_008_a   LM_LM_KI_008    DFr16-2132-C9C2CANXX-7
## LM_LM_KI_008_b   LM_LM_KI_008    DFr16-2132-C9C2CANXX-7
## LM_LM_KI_055_a   LM_LM_KI_055    DLit16-2508-CA95VANXX-6
## LM_LM_KI_055_b   LM_LM_KI_055    DLit16-2508-C9P32ANXX-3
## LM_LM_KI_071_a   LM_LM_KI_071    DLit16-2508-CA95VANXX-6
## LM_LM_KI_071_b   LM_LM_KI_071    DLit16-2508-C9P32ANXX-3
## LM_LM_KI_134_a   LM_LM_KI_134    DLit16-2508-CA95VANXX-6
## LM_LM_KI_134_b   LM_LM_KI_134    DLit16-2508-C9P32ANXX-3
## LM_LM_KI_141_a   LM_LM_KI_141    DLit16-2508-CA95VANXX-6

Then we need an input vcf file. So far we have only gotten to the gstacks module of stacks. So we will need to use the populations module to obtain a vcf file.

I ran this step on a stacks catalogue containing 3 species, from 4 linkage mapping crosses and 2 population genomic studies [paper to come soon]. Thus using the standard -r 0.8 filter of populations would possibly remove a lot of loci that are present only in one species, or only in 1 family. To circumvent this, we will run populations with -R 0.05 and -r 0.6, and then use vcf tools to further filter our vcf. In particular, we will retain only bi-allelic markers, with a mean depth of 5 and a max depth of 50.

The populations script I used is called 2_4_populations_error.sh, and it looks like this:

#!/bin/bash

#PBS -N populations-array
#PBS -l ncpus=8
#PBS -l mem=4g
#PBS -J 1-30

##### Change Directory #####
cd ~/ProjectDirectory/1_DataPrep

##### Obtain parameters from par file #####
parameters=`sed -n "${PBS_ARRAY_INDEX} p" 0_metadata/2_3_cstacks_array`

parameterArray=($parameters)

PAR=${parameterArray[0]}
VALUE=${parameterArray[1]}

##### Load stacks #####
module load stacks/2.55

log_file=2_4_populations.$PAR.$VALUE.oe
files_dir=2_error/2_ustacks/parTest.$PAR$VALUE/

##### run populations #####
populations -P $files_dir -M 0_metadata/0_popfile_pops_error --out-path $files_dir -t 8 -R 0.05 -r 0.6 --vcf --genepop

After exporting vcf files with the above filters, I then filter with vcftools to retain only bi-allelic markers, with a min read depth of 5 and a max read depth of 50. Ideally I would look at the distribution of read depth to decide those thresholds. I have done so multiple times before for various subsets of this data (i.e., same exact fastq files) and have found 5 and 50 to be good lower and upper bounds, that are not too conservative but remove major outliers and the long tail.

The script for filtering vcf files is 2_5_vcftools_error.sh, and contains the following:

#!/bin/bash

#PBS -N vcftools-filtering
#PBS -l ncpus=4
#PBS -l mem=4g
#PBS -J 1-29

##### Change Directory #####
cd ~/ProjectDirectory/1_DataPrep

##### Obtain parameters from par file #####
parameters=`sed -n "${PBS_ARRAY_INDEX} p" 0_metadata/2_3_cstacks_array_NOm1`

parameterArray=($parameters)

PAR=${parameterArray[0]}
VALUE=${parameterArray[1]}

##### Load stacks #####
module load vcftools

log_file=2_5_vcftools.$PAR.$VALUE.oe
files_dir=./2_error/2_ustacks/parTest.$PAR$VALUE

##### run vcftools #####
vcftools --vcf $files_dir/populations.snps.vcf --out $files_dir/vcftools_$PAR$VALUE --min-alleles 2 --max-alleles 2 --min-meanDP 5 --max-meanDP 50 --recode

Now I have all that I need to run tiger. VCF input files are still within the 2_error/2_ustacks/parTest.\(PAR\)VALUE folder. (e.g. parTest.M1 for an M value of 1). The tiger metadata 3 column file is within the tiger folder 2_error/6_tiger. I run tiger with the script 2_6_tiger_error.sh.

#PBS -N Tiger-genoError
#PBS -j oe
#PBS -l nodes=1:ppn=1
#PBS -l mem=10gb
#PBS -l walltime=999:00:00
#PBS -m ae 
#PBS -M lorenzo.bertola@my.jcu.edu.au
#PBS -J 1-29

##### Change Directory #####
cd ~/ProjectDirectory/1_DataPrep

##### Obtain parameters from par file #####
parameters=`sed -n "${PBS_ARRAY_INDEX} p" 0_metadata/2_3_cstacks_array_NOm1`

parameterArray=($parameters)

PAR=${parameterArray[0]}
VALUE=${parameterArray[1]}

###### load the tiger module #####
module load tiger

##### cd into output directory #####
cd ./2_error/6_tiger/parTest.$PAR$VALUE

##### define vcf file #####
vcf=../../2_ustacks/parTest.$PAR$VALUE/vcftools_$PAR$VALUE.recode.vcf
meta=../2_6_tiger_metadata.txt

tiger task=estimateIndReps vcf=$vcf groups=$meta groupCol=2 batches=$meta batchesCol=3 verbose > ../../../0_logfiles/2_6_tiger_$PAR$VALUE.logfile

Furthermore, because for many lanes I only had 1-3 individuals in this set, and from previous analyses I had not encountered differential error rates across lanes, I decided to remove the lane info and estimate error rate on the whole set. I simply changed the 3rd column of the 2_6_tiger_metadata.txt file to one single value (i.e., mixed) and re-ran tiger.

At this point I have one file per parameter set, for a total of 29 files. Each file has 7 columns, namely:

Batch ID: in this case “mixed” for all samples
minDepth
maxDepth
Size: as in, how many calls with that depth
Error rate
Error rate homozygotes
Error rate heterozygotes

Let’s look at one file, for the default stacks parameters (i.e., M=2, m=3, n=1)

library(tidyverse)

my.df <- read_delim("../1_stacks/2_error/6_tiger/2_6_tiger_error_M2_noLane_errorRates.txt", delim = "\t")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Batch = col_character(),
##   minDepth = col_double(),
##   maxDepth = col_double(),
##   Size = col_double(),
##   ErrorRate = col_double(),
##   ErrorRateHom = col_double(),
##   ErrorRateHet = col_double()
## )

ggplot(data = my.df) +
  geom_line(aes(x=minDepth, y=ErrorRate),colour="black") +
  geom_line(aes(x=minDepth, y=ErrorRateHom), colour="red") +
  geom_line(aes(x=minDepth, y=ErrorRateHet), colour="blue") +
  xlab("Genotype read depth") +
  theme_bw()

We can see that past a read Depth of 50 the error rate really skyrockets, and jumps between 0 and 0.5. I filtered the stacks output for max-meanDP of 50, but I haven’t filtered for maxDP yet. The former filters sites by average depth, calculate across all individuals, while the latter filters genotypes by their depth. When moving forward with the pipeline we will also filter with the latter. For now, let’s look at how many genotypes we have across depths.

ggplot(data=my.df) + geom_col(aes(x=minDepth, y=Size)) +
  ylab("Count") + xlab("Genotype read depth") + theme_bw()

This shows us that most genotype calls have a depth between 0 and 25. It also points to the fact that the high error rate after a read depth of 50 is probably due to the lack of calls with that depth. Thus while looking at error rate we can ignore genotypes with a depth > 50, as we will likely filter those out.

ggplot(data=my.df %>% filter(minDepth <= 40)) +
  geom_col(aes(x=minDepth, y=Size)) +
  ylab("Count") + xlab("Genotype read depth") + theme_bw()

Let us filter our dataframe then for genotypes with a max read-depth of 50, and replot the original graph

my.df <- my.df %>% filter(minDepth <= 50)

ggplot(data = my.df) +
  geom_line(aes(x=minDepth, y=ErrorRate),colour="black") +
  geom_line(aes(x=minDepth, y=ErrorRateHom), colour="red") +
  geom_line(aes(x=minDepth, y=ErrorRateHet), colour="blue") +
  xlab("Genotype read depth") +
  #geom_hline(aes(yintercept=0.02)) + 
  theme_bw()

We can see that the error rate for homozygotes (red) stays low across all depths, while the error rate for heterozygotes is much higher at read depths below 5-10. Unfortunately, between 5 and 10 is also where most of our genotype calls fall. To achieve an error rate < 5 % across our dataset we will probably remove all genotype calls with a read depth < 7 at least, and maybe 10. But now, let’s look at the error rate across all parameter replicates for stacks.

First we read all error txt files produced by tiger, and named them according to the parameters used. We can extract the parameter information from their filename.

parList <- list()

namesList <- list.files(path = "../1_stacks/2_error/6_tiger/", pattern = "2_6_tiger_error_*")
namesList <- lapply(namesList, function(x){
  x <- str_match(x, '([^_]+)(?:_[^_]+){2}$')[,2]
  return(x)
})

parList <- list.files(path = "../1_stacks/2_error/6_tiger/", pattern = "2_6_tiger_error_*", full.names = TRUE, include.dirs = TRUE)
names(parList) <- unlist(namesList)

parList <- lapply(parList, read_delim, delim = "\t")
rm(namesList)

Then we will filter all of them to retain only results for depth between 0 and 50, as we saw that after that the error rate increases substantially.

parList <- lapply(parList, function(x){
  x <- x %>% filter(minDepth <= 50)
  return(x)
})

Now, as a first thing we calculate the average error rate for each parameter set, for both overall, hom and hets.

#calculate mean errors for each dataset
errList <- lapply(parList, function(x){
  temp <- tibble(error = mean(x$ErrorRate),
                 errHom = mean(x$ErrorRateHom),
                 errHet = mean(x$ErrorRateHet))
  return(temp)
})

#add parset name
errList <- Map(function(X,Y) {
  X$parset <- Y
  X
}, X=errList,Y=names(parList))

#bind all to a dataframe
err.df <- bind_rows(errList)
err.df$parset <- factor(err.df$parset,
                        levels = c("lM1","lM2","lM3","lM4","lM5",
                                   "lM6","lM7","lM8","lM9","lM10",
                                   "m1","m2","m3","m4","m5",
                                   "m6","m7","m8","m9","m10",
                                   "m11","m12","m13","m14","m15",
                                   "n1","n2","n3","n4","n5"))

ggplot(data = err.df) +
  geom_point(aes(x=parset, y=error), colour="black") +
  geom_point(aes(x=parset, y=errHom), colour="red") +
  geom_point(aes(x=parset, y=errHet), colour="blue") +
  ylab("Error rate") + xlab(element_blank()) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))

From this alone it looks like M = 2 and n = 2 achieve the lowest error, while different values of n do not affect error significantly.

Let’s look how this differ when we remove calls with a read depth below 5

parList2 <- lapply(parList, function(x){
  x <- x %>% filter(minDepth > 7)
  return(x)
})

Now, as a first thing we calculate the average error rate for each parameter set, for both overall, hom and hets.

#calculate mean errors for each dataset
errList2 <- lapply(parList2, function(x){
  temp <- tibble(error = mean(x$ErrorRate),
                 errHom = mean(x$ErrorRateHom),
                 errHet = mean(x$ErrorRateHet))
  return(temp)
})

#add parset name
errList2 <- Map(function(X,Y) {
  X$parset <- Y
  X
}, X=errList2,Y=names(parList2))

#bind all to a dataframe
err.df2 <- bind_rows(errList2)
err.df2$parset <- factor(err.df2$parset,
                        levels = c("lM1","lM2","lM3","lM4","lM5",
                                   "lM6","lM7","lM8","lM9","lM10",
                                   "m1","m2","m3","m4","m5",
                                   "m6","m7","m8","m9","m10",
                                   "m11","m12","m13","m14","m15",
                                   "n1","n2","n3","n4","n5"))

ggplot(data = err.df2) +
  geom_point(aes(x=parset, y=error), colour="black") +
  geom_point(aes(x=parset, y=errHom), colour="red") +
  geom_point(aes(x=parset, y=errHet), colour="blue") +
  ylab("Error rate") + xlab(element_blank()) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))

This shows us that a bit of the pattern observed on the whole dataset is the result of stochasticity of genotypes with low read depth. This would indicate values of M and m of 4 as better candidates for our parameters.

FROM THIS WE TAKE THAT: M=4 and m=4 are good values, n we leave at default (can’t assess error rate with different n values using replicates as n is used after assembling stacks for individuals). We also need to filter genotype calls with read depth <7 using vcftools as below this depth we get a high mismatch of heterozygotes, probably due to too low reads to successfully call heterozygotes.

5.4 Number of retrieved markers across parameter sets

The populations module contains in its log files information on the amount of total loci, variant and polymorphic sites for each population. We have one populations.log for each parameter set explored, for a total of 29 log files. I extracted the relevant information with the script 2_7_extractSitesCounts.sh. The script is as follows:

# create the report file with header
echo -e "parTest pop all variant polymorphic" > 2_7_error_sites.txt

#loop through the parameters you explored. In this case M, m and n
for par in M m n
do
#loop throught the different values. Note each parameter set is in a folder name as follow: parTest.$par$i, where par is the first loop and i the second
for i in ../2_error/2_ustacks/parTest.$par*
do
KI=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==1 {print $10}'`
WC=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==2 {print $10}'`
LM=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==3 {print $10}'`
FC=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==4 {print $10}'`
LS=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==5 {print $10}'`
HC=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==6 {print $10}'`
echo -e $i" KI "$KI | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites.txt
echo -e $i" WC "$WC | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites.txt
echo -e $i" LM "$LM | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites.txt
echo -e $i" FC "$FC | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites.txt
echo -e $i" LS "$LS | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites.txt
echo -e $i" HC "$HC | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites.txt
done
done

Now let’s import the resulting file, and see how many markers we obtain across parameter space.

library(tidyverse)

sites.df <- read_delim("../1_stacks/2_error/7_sites/2_7_error_sites.txt",
                       delim = " ")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   parTest = col_character(),
##   pop = col_character(),
##   all = col_double(),
##   variant = col_double(),
##   polymorphic = col_double()
## )

glimpse(sites.df)

## Rows: 174
## Columns: 5
## $ parTest     <chr> "M1", "M1", "M1", "M1", "M1", "M1", "M10", "M10", "M10", "…
## $ pop         <chr> "KI", "WC", "LM", "FC", "LS", "HC", "KI", "WC", "LM", "FC"…
## $ all         <dbl> 2363195, 2191553, 2405801, 3856944, 1846673, 1956140, 1962…
## $ variant     <dbl> 48567, 46270, 53092, 41060, 42056, 42627, 59960, 57125, 65…
## $ polymorphic <dbl> 14947, 14388, 33161, 16621, 20596, 16494, 20689, 19373, 43…

We first divide the parTest column into a par column (i.e., M, m and n) and a value column, with values ranging from 1 to 15.

sites.df <- sites.df %>%
  mutate(par = substr(x = parTest, 1, 1)) %>%
  mutate(value = as.double(substr(x = parTest,2,length(x = parTest))))

Now let’s plot:

the total number of loci recovered
the number of variant sites
the number of polymorphic sites

5.4.1 Total number of loci

p1 <- ggplot(sites.df) + 
  geom_line(aes(x=value, y=all, group=par, color=par),
            size = .2) +
  geom_point(aes(x=value, y=all, group=par, color=par),
             size = .4) +
  scale_x_continuous(breaks = seq(1,15,2)) +
  facet_wrap(~pop) +
  ylab("Total number of loci shared by 60% of samples") +
  xlab(element_blank()) + theme_bw() +
  theme(legend.title = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.line = element_line(colour = "black"))
p1

5.4.2 Number of variant loci

p2 <- ggplot(sites.df) + 
  geom_line(aes(x=value, y=variant, group=par, color=par),
            size = .2) +
  geom_point(aes(x=value, y=variant, group=par, color=par),
             size = .4) +
  scale_x_continuous(breaks = seq(1,15,2)) +
  facet_wrap(~pop) +
  ylab("N. of variant loci shared by 60% of samples") +
  xlab(element_blank()) + theme_bw() +
  theme(legend.title = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.line = element_line(colour = "black"))
p2

5.4.3 Number of polymorphic loci

p3 <- ggplot(sites.df) + 
  geom_line(aes(x=value, y=polymorphic, group=par, color=par),
            size = .2) +
  geom_point(aes(x=value, y=polymorphic, group=par, color=par),
             size = .4) +
  scale_x_continuous(breaks = seq(1,15,2)) +
  facet_wrap(~pop) +
  ylab("N. of polymorphic loci shared by 60% of samples") +
  xlab(element_blank()) + theme_bw() +
  theme(legend.title = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.line = element_line(colour = "black"))
p3

5.5 Plot summary statistics following Paris

Here we plot the same summary statistics used by Paris et al in this road map paper for stacks.

5.5.1 Mean coverage

First, I look at mean coverage across values of m. I do not look at it across values of M and n as those do not change coverage mostly.

my.cov <- read_delim("../1_stacks/2_error/8_summaryMetrics/summary_mean_cov", delim = "\t",
                     col_names = c("sampleID","nLoci","nUsedReads",
                                   "mean_cov","mean_cov_ns","par"))

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   sampleID = col_character(),
##   nLoci = col_double(),
##   nUsedReads = col_double(),
##   mean_cov = col_double(),
##   mean_cov_ns = col_double(),
##   par = col_character()
## )

my.cov <- my.cov %>% filter(str_detect(par, "m"))

my.cov$par <- factor(my.cov$par, levels = c("m2","m3","m4","m5",
                                            "m6","m7","m8","m9","m10",
                                            "m11","m12","m13","m14","m15"))


glimpse(my.cov)

## Rows: 1,344
## Columns: 6
## $ sampleID    <chr> "LM_LM_KI_008_a", "LM_LM_KI_008_b", "LM_LM_KI_055_a", "LM_…
## $ nLoci       <dbl> 17006, 14427, 7825, 8926, 8722, 10560, 9834, 11454, 9509, …
## $ nUsedReads  <dbl> 375237, 314165, 151647, 176250, 163265, 209939, 194265, 23…
## $ mean_cov    <dbl> 22.065, 21.776, 19.380, 19.746, 18.719, 19.881, 19.754, 20…
## $ mean_cov_ns <dbl> 25.741, 24.945, 20.380, 21.125, 19.942, 21.580, 21.154, 22…
## $ par         <fct> m10, m10, m10, m10, m10, m10, m10, m10, m10, m10, m10, m10…

ggplot(my.cov) + geom_boxplot(aes(x=par, y=mean_cov, group=par)) +
  xlab(element_blank()) +
  ylab("Mean coverage") +
  theme_bw()

We can see a steady increase, but no plateau like Paris and colleagues observed in some of their datasets.

5.5.2 Loci present in >= 80% of samples (r80 loci)

While previously I was a bit less strict and filtered only for loci present in 60% of samples per population and 5% of samples in total, to ensure that we obtain the same pattern using the r80 method suggested by Paris et al I re-ran the populations module with the following script 2_7_populations_error_r80.sh.

#!/bin/bash

#PBS -N populations-array
#PBS -l ncpus=8
#PBS -l mem=4g
#PBS -J 1-29

##### Change Directory #####
cd ~/2021_LinkageMapSexLinkage/1_stacks

##### Obtain parameters from par file #####
parameters=`sed -n "${PBS_ARRAY_INDEX} p" 0_metadata/2_3_cstacks_array`

parameterArray=($parameters)

PAR=${parameterArray[0]}
VALUE=${parameterArray[1]}

##### Load stacks #####
module load stacks/2.55

log_file=2_7_populations_r80.$PAR.$VALUE.oe
files_dir=2_error/2_ustacks/parTest.$PAR$VALUE/
out_dir=2_error/7_sites/parTest.$PAR$VALUE

##### run populations #####
populations -P $files_dir -M 0_metadata/0_popfile_pops_error --out-path $out_dir -t 8 -r 0.8 --vcf --genepop

Let’s extract number of sites info as we did above

# create the report file with header
echo -e "parTest pop all variant polymorphic" > 2_7_error_sites_r80.txt

#loop through the parameters you explored. In this case M, m and n
for par in M m n
do
#loop throught the different values. Note each parameter set is in a folder name as follow: parTest.$par$i, where par is the first loop and i the second
for i in ../2_error/7_sites/parTest.$par*
do
KI=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==1 {print $10}'`
WC=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==2 {print $10}'`
LM=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==3 {print $10}'`
FC=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==4 {print $10}'`
LS=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==5 {print $10}'`
HC=`grep 'all/variant/polymorphic sites:' $i/populations.log | awk 'NR==6 {print $10}'`
echo -e $i" KI "$KI | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites_r80.txt
echo -e $i" WC "$WC | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites_r80.txt
echo -e $i" LM "$LM | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites_r80.txt
echo -e $i" FC "$FC | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites_r80.txt
echo -e $i" LS "$LS | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites_r80.txt
echo -e $i" HC "$HC | sed 's/\// /g' | sed 's/;//g' | sed 's/.. 2_error 2_ustacks parTest.//g' >> 2_7_error_sites_r80.txt
done
done

Now let’s import the resulting file, and see how many markers we obtain across parameter space.

r80.df <- read_delim("../1_stacks/2_error/7_sites/2_7_error_sites_r80.txt",
                       delim = " ")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   parTest = col_character(),
##   pop = col_character(),
##   all = col_double(),
##   variant = col_double(),
##   polymorphic = col_double()
## )

glimpse(r80.df)

## Rows: 174
## Columns: 5
## $ parTest     <chr> "M1", "M1", "M1", "M1", "M1", "M1", "M10", "M10", "M10", "…
## $ pop         <chr> "KI", "WC", "LM", "FC", "LS", "HC", "KI", "WC", "LM", "FC"…
## $ all         <dbl> 1771160, 1651971, 1617345, 2928809, 1329280, 1379102, 1482…
## $ variant     <dbl> 34448, 32658, 34604, 29790, 28802, 28790, 42390, 40587, 42…
## $ polymorphic <dbl> 10679, 10230, 21461, 12622, 14635, 11153, 14440, 13747, 28…

Divide the parTest column into a par column (i.e., M, m and n) and a value column, with values ranging from 1 to 15.

r80.df <- r80.df %>%
  mutate(par = substr(x = parTest, 1, 1)) %>%
  mutate(value = as.double(substr(x = parTest,2,length(x = parTest))))

Plot like we did before

Total number of loci

Number of variant loci

Number of polymorphic loci

5.6 Deciding a parameter set for the main run

Now, using the information from Tiger in combination with the amount of loci retrieved across parameter space we can decide what parameters to use for our main stacks pipeline:

5.6.1 Number of mismatches between 2 alleles (M)

This parameter controls the amount of mismatches (in base pairs) allowed between 2 alleles. Ideally, for a sequence with only 1 true SNP there would be only 1 mismatch. When multiple SNPs are present on a single sequence - let’s say 2 - then there can be up to 2 real mismatches. On top of this we have to account for errors in sequencing that will increase the number of mismatches. So, while at first look this is a simple decision, the layers present can make this harder.

Looking at the number of variant and polymorphic loci across values of M, we can see that the steepest increase is always observed between 1 and 3, and after that increasing the value of M increasing the amount of loci only slightly. This is an indication that our ideal value should be somewhere around 3. The number of total loci decreases linearly for M, so this particular metric doesn’t provide much support for either value.

Looking at the error rate from Tiger, we can see that with genotype read depth < 5 the error rate is nearly always around 0.5 (very high). For M values between 1 and 4 although error rate is never above 0.05 for read depths > 5. Because of our average read depth (~10x) for individuals, it is likely that we will silence genotypes with a read depth <5-7. This will take care of those genotypes with a higher error rate. But it also means we can’t/shouldn’t increase our value of M much past 3-4. This is clear when comparing error rates across parameter space before (Fig. x) and after (Fig. x) silencing genotypes with a read depth < 7. Thus, this is another indication that our ideal value of M is somewhere around 3-4. Because the increase in the number of variant/polymorphic loci between 3 and 4 is of small magnitude, we will then use a more conservative M value of 3.

5.6.2 Minimum read depth per allele (m)

This parameter defines the minimum numbers of identical sequences required to create a stack, which will then represent an allele. The default for this parameter is 3, which was proven to work well across a range of scenarios. Here we see that for m between 1 and 4, the error rate for het calls with a minimum read depth > 5 stays below 0.05, while it increases to 0.5 for depth of 1 to 5. Furthermore, we see that when removing genotype calls with read depth < 7, the erro rate stays constant between 1 and 4. Four would be a bit more conservative, while still not a huge deviation from the standard value of 5. For this reason we use 4 as the value for m in this run.

5.6.3 Number of mismatches between any 2 alleles of the population (n)

This parameter defines the number of mismatches allowed between 2 alleles of a population, in base pairs. The standard value is 1. The error rate from Tiger is uninformative here as the error is calculated within individuals, thus changes in the calling across individuals are not expected to cause major changes in error rates within individuals (Fig. x). On the other hand, we can see that the number of variant and polymorphic markers increases steadily with larger values of n. That is to be expected. But there doesn’t seem to be any sign of a plateau for polymorphic loci. On the other hand, the rate of increase for variant loci decreases from 3 to 4, while it’s at its highest between 1 and 2, and between 2 and 3 (Fig. x). It is important to remember that in this dataset we are including 3 lineages, with 2 lineages being geneticall similar (1% divergence on mitochondrial loci) and the 3rd lineage being up to 15% divergent from the previous two for the same loci. Thus it is reasonable to expect more than 1 mismatch within a locus between two of those more divergent lineages. Thus, we will use an n value of 3.

6. Running stacks with final parameter set with Gapped alignments

Now that we have settled on a parameter set (M=3, m=4, n=3, with everything else left as default) we can run stacks. I did so for more linkage map families and individuals that included in the current studies as part of a larger project, but here I report only the one Litoria serrata linkage map family (HC family) and a subset of sexed Litoria serrata individuals from McKnight et al., (2019).

HC Linkage family: 315 tadpoles, 2 parents plus replicates (389 total)
Sexed individual from McKnight: 38 including replicates (38 total)

6.1. ustacks

The first thing to do is to run ustacks on all samples. To do so, I produced the script 3_1_ustacks_M3m4.sh.

#!/bin/bash

#PBS -j oe
#PBS -l nodes=1:ppn=1
#PBS -l mem=4gb
#PBS -l walltime=999:00:00

#load the stacks module
module load stacks/2.55

#cd into the desired directory
cd ~/ProjectDirectory/1_DataPrep

sample_name=`echo $sample | sed -e 's#1_cleaned/##g' -e 's/.fq.gz//g'`
log_file=./0_logfiles/3_1_ustacks/${sample_name}.ustacks.oe

ustacks -f $sample -i $sample_index -o ./3_calling/1_ustacks/ -M 3 -m 4 &> $log_file

Which I then submit with the following command from the 1_stacks directory.

sample_index=0
for sample in ./1_cleaned/*
do
  sample_index=(expr $sample_index + 1)
  qsub -v "sample=$sample, sample_index=$sample_index" ./0_scripts/3_1_ustacks_M3m4.sh
done

This script, and the way it is submitted, will submit a job for each individual. Even when running ~1500 individuals, submitting at night with only ~10 other jobs running on the JCU HPC it takes all the jobs around 4-5 hours to complete.

Finally, to ensure that all jobs ran successfully, I use the below command to scan for error messages in the log files. Just make sure that you are looping through all your ustacks logfiles, in my case found within the 0_logfiles/3_1_ustacks folder.

for log in 0_logfiles/3_1_ustacks/*
do
grep -iE "\b(err|e:|warn|w:|fail|abort)" $log
done

If no message is printed to the console it means that no error message was found in the logfiles. You are then good to proceed to the cstacks step.

6.2. cstacks

For cstacks, we have selected n=3, with everything else left as default. Thus for this experiment, our selected value of M matches the value of n, which is recommended by stacks developers. A further support for our chosen values for M and n. Now, because we are including a linkage family, and because we have a huge number of samples, we will use only a subset of these samples to build the catalogue. For the linkage family we will choose only the parents, while for the adults we will choose a representative 10 % of samples.

First, let’s list all samples with the following, and save to a file called 3_2_popfile_cstacks.

echo 1_cleaned/*fq.gz | sed -e 's#1_cleaned/##g' -e 's/.fq.gz /\tall\n/g' -e 's/.fq.gz/\tall/g' > 0_metadata/3_2_popfile_cstacks

Then, I remove manually the individuals that I don’t want to retain. File size and number of reads are pretty consistent within batches and groups, so choosing 10% of samples per group, and only parents for linkage maps, should produce a representative catalogue with minimal error.

We then run the 3_2_cstacks_n3.sh script

#!/bin/bash

#PBS -N ustacks-array
#PBS -l ncpus=30
#PBS -l mem=15g
#PBS -l walltime=999:00:00

##### Change Directory #####
cd ~/ProjectDirectory/1_DataPrep

##### Load stacks #####
module load stacks/2.55

log_file=3_2_cstacks.n3.oe
files_dir=3_calling/1_ustacks/

##### run cstacks #####
cstacks -n 3 -P $files_dir -M 0_metadata/3_2_popfile_cstacks_parents -o $files_dir -p 30 > ./0_logfiles/$log_file

6.3. sstacks to gstacks

Finally, we can run the stacks module sstacks, tsv2bam and gstacks. To do so I use the script 3_3_sstacks.sh first.

#!/bin/bash

#PBS -N sstacks
#PBS -l ncpus=40
#PBS -l mem=20g
#PBS -l walltime=999:00:00

##### Change Directory #####
cd ~/ProjectDirectory/1_DataPrep

##### Load stacks #####
module load stacks/2.55

log_file=3_3_sstacks.oe
files_dir=3_calling/1_ustacks/

##### run sstacks #####
sstacks -P $files_dir -M 0_metadata/3_3_popfile_sstacks_all -p 40

##### run tsv2bam #####
tsv2bam -P $files_dir -M 0_metadata/3_3_popfile_sstacks_all -t 40

And then the 3_4_gstacks.sh script.

#!/bin/bash

#PBS -N gstacks
#PBS -l ncpus=40
#PBS -l mem=100g
#PBS -l walltime=999:00:00

##### Change Directory #####
cd ~/ProjectDirectory/1_DataPrep

##### Load stacks #####
module load stacks/2.55

log_file=3_3_sstacks.oe
files_dir=3_calling/1_ustacks/

##### increase max number of files that can be opened #####
ulimit -n 4096

##### run sstacks #####
gstacks -P $files_dir -M 0_metadata/3_3_popfile_sstacks_all -t 40

7. Catalogue summary and conclusion

I now have a catalogue of loci, which includes samples from all Litoria serrata/myola individuals sequenced so far.

From this catalogue, gstacks genotyped 197663 loci, with an effective per sample mean coverage of 12.8x (stdev=2.2x, min=7.0x, max=34.6x). Let’s look at coverage per sample more in detail. I extract the per sample coverage using the following command:

sed -n '/^BEGIN effective_coverages_per_sample/,/END effective_coverages_per_sample/p' ./3_calling/1_ustacks/gstacks.log.distribs  | sed '1d;$d' > ./3_calling/3_summary/3_4_gstacks_effectiveCoveragePerSample.txt

Now let’s import the resulting file and plot per sample effective coverage

library(tidyverse)

meanCov <- read_delim("../1_stacks/3_calling/3_summary/3_4_gstacks_effectiveCoveragePerSample.txt", delim = "\t", skip = 2)

glimpse(meanCov)

## Rows: 1,463
## Columns: 5
## $ sample          <chr> "LM_LM_KI_001", "LM_LM_KI_002", "LM_LM_KI_003", "LM_LM…
## $ n_loci          <dbl> 36708, 36489, 36916, 32214, 37535, 38246, 39080, 38244…
## $ n_used_fw_reads <dbl> 492304, 486476, 498638, 385777, 512550, 540758, 553622…
## $ mean_cov        <dbl> 13.411, 13.332, 13.507, 11.975, 13.655, 14.139, 14.166…
## $ mean_cov_ns     <dbl> 16.161, 16.083, 16.323, 13.889, 16.604, 17.328, 17.442…

Let’s add a column describing the project each sample comes from:

meanCov <- meanCov %>%
  mutate("proj"=substr(sample,4,5))

Let’s see which samples have the highest coverage.

meanCov %>%
  select(-n_used_fw_reads) %>%
  arrange(-mean_cov_ns) %>%
  head(n=10) %>%
  print.data.frame()

##              sample n_loci mean_cov mean_cov_ns proj
## 1   LM_LM_KI_MOT_F1  59765   22.481      34.589   LM
## 2   LN_LM_FC_MOT_F1  81895   20.126      33.383   LM
## 3   LS_LM_HC_MOT_F1  56357   21.047      31.064   LM
## 4   LS_LM_HC_FAT_M1  54319   19.129      27.570   LM
## 5   LM_LM_WC_FAT_M1  48028   17.575      23.875   LM
## 6   LM_LM_KI_FAT_M1  49160   17.119      23.496   LM
## 7   LM_LM_WC_MOT_F1  43286   17.328      22.536   LM
## 8      LS_LM_HC_311  20995   17.399      21.487   LM
## 9   LN_LM_FC_FAT_M1  54163   14.728      19.607   LM
## 10 LM_PG_WC_27_M1_b  37286   15.535      19.269   PG

We can see that 8 out of the 10 samples with the highest coverage are linkage mapping parents, for whom we have merged the independent dart replicate sequencing results, exactly to obtain higher coverage and thus higher confidence in their genotypes. For the offspring, we will obtain the confidence from their large numbers per family, looking at mendelian segregation patterns, and whether they match the parent genotypes. Thus, the high ‘outliers’ in terms of mean effective coverage should not be an issue. Now let’s look at the other end of the spectrum.

meanCov %>%
  select(-n_used_fw_reads) %>%
  filter(proj == "LM") %>%
  arrange(mean_cov_ns) %>%
  head(n=10) %>%
  print.data.frame()

##            sample n_loci mean_cov mean_cov_ns proj
## 1  LS_LM_HC_148_b  14058    8.927       9.382   LM
## 2    LS_LM_HC_041  19164    9.029       9.635   LM
## 3  LS_LM_HC_275_b  15477    9.172       9.637   LM
## 4  LS_LM_HC_274_b  17889    9.050       9.649   LM
## 5  LS_LM_HC_272_b  18566    9.159       9.774   LM
## 6  LS_LM_HC_118_b  16514    9.221       9.800   LM
## 7  LM_LM_KI_021_b  21563    9.016       9.809   LM
## 8  LS_LM_HC_270_b  19578    9.212       9.909   LM
## 9  LS_LM_HC_263_b  19023    9.377      10.020   LM
## 10 LS_LM_HC_271_b  19051    9.372      10.042   LM

The linkage mapping samples all have good coverage and number of loci. Let’s look at the correlation between genotyped loci and coverage.

meanCov %>%
  arrange(n_loci) %>%
  filter(proj == "DM" | proj == "LM") %>%
  ggplot() +
  geom_point(aes(x=n_loci, y=mean_cov_ns, colour=proj)) +
  geom_segment(aes(x=30000,xend=60000,y=22,yend=12),colour="red") +
  ylab("Mean coverage") +
  xlab("Number of genotyped loci") +
  theme_classic()

The DM samples are from McKnight et al. used for the sex-linkage analyses. The cluster on the bottom right is from another linkage mapping family for a different lineage of Litoria serrata, not used for this study. They were sequenced ~2 years after the previous samples, and thus the difference in number of loci vs coverage ratio is probably a result of differences in the technology. As we are not using them for this study we don’t have to worry about it. The 8 outliers on the top right of the red line are all linkage mapping parents which were sequenced in 2+ libraries and thus have both higher coverage and higher number of loci. TL;DR: all looks good here.

Let’s produce summary statistics for our linkage mapping family only and the sexed individuals for the sex-linkage analyses.

paperCov <- meanCov %>%
  filter(str_detect(sample, "LS_LM_HC|LS_DM_"))

Then plot it again as before

paperCov %>%
  arrange(n_loci) %>%
  filter(proj == "DM" | proj == "LM") %>%
  ggplot() +
  geom_point(aes(x=n_loci, y=mean_cov_ns, colour=proj)) +
  geom_segment(aes(x=30000,xend=60000,y=22,yend=12),colour="red") +
  ylab("Mean coverage") +
  xlab("Number of genotyped loci") +
  theme_classic()

And calculate mean coverage statistics

summary(paperCov %>% filter(proj=="LM") %>% pull(mean_cov_ns))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.382  12.379  13.093  13.320  14.129  31.064

sd(paperCov %>% filter(proj=="LM") %>% pull(mean_cov_ns))

## [1] 1.997792

summary(paperCov %>% filter(proj=="DM") %>% pull(mean_cov_ns))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.619  10.631  10.987  11.382  11.491  16.209

sd(paperCov %>% filter(proj=="DM") %>% pull(mean_cov_ns))

## [1] 1.465272

We have thus concluded the catalogue building with stacks, after exploring parameter space and the resulting error rates. In the next step we will conduct QC filtering (mostly Call Rate and MAF).