In the file 1_DataPrep_Stacks.Rmd I produced a catalogue of loci from raw fastQ files for all Litoria serrata/myola samples we have sequences thus far.
In the file 2_DataPrep_QCFiltering.Rmd I filtered individuals by call rate, retained only one individual per replicate pair, and then filtered markers by call rate.
Here I will assess segregation distortion in the linkage mapping family only. I will assign a segregation to each marker by comparing the genotype ratios observed in each offspring to those expected for different segregation types. Then, using the known parent genotypes and the segregation observed and assigned in the offspring I will further filter the markers using mendelian segregation expectations.
We have counted 0s, 1s and 2s (i.e., AA, AB, BB) genotypes for each SNP to be able to compare expected to observed genotype proportions.
Now we want to define all possible segregations. For bi-allelic markers, six possible combination of parent getnotypes are possible:
To assess what segregation each marker comes from, I will compare the observed genotype proportions to the expected proportions using a chi-square test.
First, I need to define the proportion of genotypes for all 7 possible segregations. Because chi-square cannot take 0s, I added a modest possible error rate (from 0.005 to 0.2, according to the segregation). I identified these proportions from a lot of attempts, seeing what markers would be successfully identified.
Let’s load a few packages we will need throughout
library(tidyverse)
library(knitr)
library(pander)
Basically, for each marker I calculate a p-value to state whether the observed segregation is significantly different from the expected segregation. I did so for all possible 7 segregation types for each marker.
So, for instance, let’s assume I have a marker, markerID_1, which was successfully genotyped in 120 offspring. Let’s assume that this marker has 30 AA offspring, 60 AB offspring and 30 BB offspring, with the respective proportions of 0.25 AA, 0.50 AB and 0.25 BB. I would expect p-values as follows:
# simple table creation here | MarkerID | AA | AB | BB | AAxAA | … | ABxAB | ABxAA | |—————|:—:|:—:|:—:|:——:|:—:|:——:|:——:| | markerID_1 | 30 | 60 | 30 | < 0.01 | … | 0.06 | < 0.01 | | markerID_2 | 60 | 60 | 0 | < 0.01 | … | < 0.01 | 0.06 |
This is an idealised example, but you’ll see that it is usually pretty clear cut. I will use as input a table with markerID, and counts of the three possible genotypes (AA, AB, BB) and add a column for each segregation with the correspective p-value. I would then expect p-values below a certain threshold when observed and expected counts are very different, or above a certain threshold when they are very similar and thus likely the same segregation. For instance for markerID_1 we are pretty confident that it comes from an ABxAB segregation.
I run this step on the HPC as I need a lot of cores for it to run in less than many days. To do so I use an R script called 3_AssignSegregation.R, which i submit through qsub
with the shell script 3_AssignSegregation.sh.
This is the R script:
#read in required packages
library(tidyverse)
library(parallel)
#set working directory
setwd("~/ProjectDirectory/1_DataPrep/")
#read in input RData file
metadata <- readRDS("1_3_FilteredMarkerData_metadata.RData")
#select columns of interest
temp <- lapply(metadata, function(x){x %>% select(markerID,Zero:Two)})
#create cluster of 40 cores
cl <- parallel::makeCluster(40)
#calculate chi.square for each possible segregation
fam_seg <- lapply(temp, function(d) {
AAxBB <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.005,0.99,0.005),simulate.p.value = T)$p.value)
AAxAA <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.99,0.005,0.005),simulate.p.value = T)$p.value)
BBxBB <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.005,0.005,0.99),simulate.p.value = T)$p.value)
NULLS <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.49, 0.02, 0.49),simulate.p.value = T)$p.value)
ABxAB <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.25, 0.5, 0.25),simulate.p.value = T)$p.value)
ABxAA <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.49, 0.49, 0.02),simulate.p.value = T)$p.value)
ABxBB <- parApply(cl, d[,2:4], 1, function(x) chisq.test(x=x,p=c(0.02, 0.49, 0.49),simulate.p.value = T)$p.value)
d <- cbind(d,AAxBB,AAxAA,BBxBB,NULLS,ABxAB,ABxAA,ABxBB)
return(d)
})
#stop the cluster
parallel::stopCluster(cl)
#save the results of chi square
saveRDS(fam_seg,file = "3_1_FilteredMarkerData_metadata.RData")
And this is the shell script I use to submit the R script:
#!/bin/bash
#PBS -N segCount
#PBS -l ncpus=40
#PBS -l mem=20g
#PBS -l walltime=24:00:00
#load R 4.0.3
#R 4.0.3 is available through a conda3 environment at the JCU HPC
module load anaconda3
source $CONDA_PROF/conda.sh
conda activate R-403
#set directory to linkage mapping folder
cd ~/ProjectDirectory/1_DataPrep/
#run R script
#this script will save the result within the 1_filtering folder
R -f ./0_Rscripts/3_FilteringByMendelianInheritance.R
#close conda environment
conda deactivate
Now, using the calculated chi-square p-value, I will assign a segregation to each marker. I will do so by assigning a segregation only if one of the 7 comparisons returned a non-significant result, as in, the assessed marker is statistically indistinguishable from only 1 segregation. To do this though, I need to identify an ideal alpha value.
First, I make sure I have a clear working environment
rm(list = ls())
Load packages
#read in required packages
library(tidyverse)
Then, I can read in the resulting file from the previous step: 3_1_FilteredMarkerData_metadata.RData.
metadata <- readRDS("../1_DataPrep/3_1_FilteredMarkerData_metadata.RData")
I also read initially the old metadata files to check everything worked fine
old.met <- readRDS("../1_DataPrep/2_3_FilteredMarkerData_metadata_HC.RData")
Let’s compare the number of rows to check they match.
nrow(metadata)==nrow(old.met)
## [1] TRUE
Now, let’s see if there’s many SNPs with more than 1 non-significant p-values. Those SNPs would be very similar to more than 1 segregation, and it would thus be hard/not possible to reliably assign them to only 1 segregation.
temp <- metadata %>%
mutate(Above = rowSums(select(., AAxBB:ABxBB)>0.0005))
table(temp$Above)
##
## 0 1
## 3808 12537
There are no markers that would be assigned to more than 1 segregation, while 3808 are assigned to no segregation at all.
Let’s look at markers with no assigned segregation.
head(temp %>%
select(markerID:Two,Above) %>%
filter(Above == 0))
## markerID Zero One Two Above
## 1 44_45 71 74 52 0
## 2 49_9 91 70 33 0
## 3 49_43 63 134 0 0
## 4 72_21 68 72 70 0
## 5 107_17 137 63 0 0
## 6 110_40 106 87 41 0
Visually inspecting these, they seem to be mostly highly distorted markers (e.g., 2:2:1|1:1:1|1:2:0, which is not expected from any possible parent pair). To try and get a better feel for it, I will check a few extra things.
To do this, we first have to create a column, Segregation, assigning the segregation, and giving a “NoSeg” when no segregation was assigned.
#select only column with p-values from chi-square
y <- temp %>% select(AAxBB:ABxBB)
#assign segregation at different alpha values
Seg0005 <- apply(y, 1, function(x) {
ifelse(any(x > 0.0005), names(x)[x > 0.0005], "NoSeg") })
Seg001 <- apply(y, 1, function(x) {
ifelse(any(x > 0.001), names(x)[x > 0.001], "NoSeg") })
Seg005 <- apply(y, 1, function(x) {
ifelse(any(x > 0.005), names(x)[x > 0.005], "NoSeg") })
Seg01 <- apply(y, 1, function(x) {
ifelse(any(x > 0.01), names(x)[x > 0.01], "NoSeg") })
Seg05 <- apply(y, 1, function(x) {
ifelse(any(x > 0.05), names(x)[x > 0.05], "NoSeg") })
#bind all together into one dataframe
fam_seg_all <- cbind(temp,Seg0005,Seg001,Seg005,Seg01,Seg05)
#remove unnecessary files
rm(y)
Now, let’s look at the extremes, in terms of genotype counts, for each segregation.
First we look at ABxAA markers with the most BB genotypes.
head(fam_seg_all %>%
filter(Seg0005 == "ABxAA") %>%
arrange(desc(Two))
)
## markerID Zero One Two AAxBB AAxAA BBxBB NULLS
## 1 15369_60 142 140 16 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 2 22715_26 136 134 16 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 3 737_23 144 132 15 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 4 8483_29 131 96 15 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 5 17054_56 115 88 14 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 6 27174_57 135 90 14 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## ABxAB ABxAA ABxBB Above Seg0005 Seg001 Seg005 Seg01
## 1 0.0004997501 0.0009995002 0.0004997501 1 ABxAA NoSeg NoSeg NoSeg
## 2 0.0004997501 0.0014992504 0.0004997501 1 ABxAA ABxAA NoSeg NoSeg
## 3 0.0004997501 0.0019990005 0.0004997501 1 ABxAA ABxAA NoSeg NoSeg
## 4 0.0004997501 0.0009995002 0.0004997501 1 ABxAA NoSeg NoSeg NoSeg
## 5 0.0004997501 0.0014992504 0.0004997501 1 ABxAA ABxAA NoSeg NoSeg
## 6 0.0004997501 0.0009995002 0.0004997501 1 ABxAA NoSeg NoSeg NoSeg
## Seg05
## 1 NoSeg
## 2 NoSeg
## 3 NoSeg
## 4 NoSeg
## 5 NoSeg
## 6 NoSeg
There are a maximum of 16 BB genotypes out of ~300 individuals, for an individual error rate of ~5%. At this stage I consider this acceptable.
Let’s now look at how many of the ABxAA markers were assigned to more than 1 segregation with an alpha value of 0.0001.
head(fam_seg_all %>%
mutate(Above = rowSums(
select(., AAxBB:ABxBB)>0.0001)) %>%
filter(Seg0005 == "ABxAA" & Above > 1) %>%
arrange(Two)
)
## markerID Zero One Two AAxBB AAxAA BBxBB NULLS
## 1 1_21 143 140 0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 2 42_37 145 157 0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 3 42_58 158 143 0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 4 42_63 158 143 0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 5 55_31 110 102 0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## 6 63_62 133 151 0 0.0004997501 0.0004997501 0.0004997501 0.0004997501
## ABxAB ABxAA ABxBB Above Seg0005 Seg001 Seg005 Seg01 Seg05
## 1 0.0004997501 0.05347326 0.0004997501 7 ABxAA ABxAA ABxAA ABxAA ABxAA
## 2 0.0004997501 0.03048476 0.0004997501 7 ABxAA ABxAA ABxAA ABxAA NoSeg
## 3 0.0004997501 0.02498751 0.0004997501 7 ABxAA ABxAA ABxAA ABxAA NoSeg
## 4 0.0004997501 0.03398301 0.0004997501 7 ABxAA ABxAA ABxAA ABxAA NoSeg
## 5 0.0004997501 0.10344828 0.0004997501 7 ABxAA ABxAA ABxAA ABxAA ABxAA
## 6 0.0004997501 0.03048476 0.0004997501 7 ABxAA ABxAA ABxAA ABxAA NoSeg
We can see that if we were to use an alpha value of 0.0001 markers that clearly come from an ABxAA segregation would not be identified. Let’s try and alpha value of 0.0005.
head(fam_seg_all %>%
mutate(Above = rowSums(
select(., AAxBB:ABxBB)>0.0005)) %>%
filter(Seg0005 == "ABxAA" & Above > 1) %>%
arrange(Two)
)
## [1] markerID Zero One Two AAxBB AAxAA BBxBB NULLS
## [9] ABxAB ABxAA ABxBB Above Seg0005 Seg001 Seg005 Seg01
## [17] Seg05
## <0 rows> (or 0-length row.names)
With this alpha value no markers are assigned to more than 1 segregation. This is a good result, as it means that for this category of informative markers we can identify them reliably with an alpha value of 0.0005.
But, let’s look if the markers we identified as ABxAA have similar to the expected genotype counts (i.e., 1AA:1AB:0BB). Let’s look at ABxAA markers more likely to be wrong, those with the lowest p-values.
head(fam_seg_all %>%
mutate(Above = rowSums(
select(., AAxBB:ABxBB)>0.0005)) %>%
filter(Seg0005 == "ABxAA") %>%
filter(Above > 0) %>%
arrange(ABxAA) %>%
select(markerID:Two,ABxAA)
)
## markerID Zero One Two ABxAA
## 1 525_57 126 66 2 0.0009995002
## 2 527_20 144 76 0 0.0009995002
## 3 744_50 147 87 3 0.0009995002
## 4 1006_41 78 127 0 0.0009995002
## 5 2013_49 153 79 0 0.0009995002
## 6 2187_61 144 80 11 0.0009995002
We can see that there is some segregation distortion (e.g., 525_57 has a ratio of 2AA:1AB:0BB), with some erroneous individuals. Because of the low number of erroneous individuals we will retain this markers for now, as they are likely to be the result of segregation distortion more than sequencing and calling error.
Nevertheless, let’s see what they would look like if we used a more conservative alpha value.
head(fam_seg_all %>%
mutate(Above = rowSums(
select(., AAxBB:ABxBB)>0.001)) %>%
filter(Seg0005 == "ABxAA") %>%
filter(Above > 0) %>%
arrange(ABxAA) %>%
select(markerID:Two,ABxAA)
)
## markerID Zero One Two ABxAA
## 1 107_40 74 119 0 0.00149925
## 2 604_27 160 100 4 0.00149925
## 3 613_53 141 92 0 0.00149925
## 4 1038_40 169 116 11 0.00149925
## 5 1114_15 106 154 0 0.00149925
## 6 2067_56 82 143 1 0.00149925
The ratio of AA to AB is closer to 1:1, but we have a similar amount of erroneous genotypes (BB individuals).
For these markers 0.001 might be too conservative. Let’s see how many fall below that threshold.
nrow(fam_seg_all %>%
filter(Seg0005 == "ABxAA") %>%
filter(Above > 0) %>%
filter(ABxAA <= 0.001)
)
## [1] 129
Not too many. 0.001 might be an all right threshold for ABxAA markers to remove markers that are too distorted while retaining the most markers possible. Let’s look at how many markers we would lose with a more conservative alpha value of 0.005.
nrow(fam_seg_all %>%
filter(Seg0005 == "ABxAA") %>%
filter(Above > 0) %>%
filter(ABxAA <= 0.005)
)
## [1] 641
Nearly 6 times as much, so possibly too conservative.
Null alleles are those where one of the two parents doesn’t get sequenced (possibly because of a mutation at the cut site). The other parent in question has an AB genotype. All offspring end up as either AA or BB. These markers can still be used for linkage mapping, if we consider the – parent homozygote. To make it easy, and because all information comes from the heterozygote parent anyway, we can consider the – parent as AA, and thus all A- offspring become AA, and all B- offspring become AB.
Let’s look at markers identified as NULLS with an alpha value of 0.0005, ordering them by how many incorrect AB individuals there are. NOTE that for these markers the AB individuals might in fact be correct and be the ones for which both alleles could be sequenced despite the mutation at the restriction site (if this is the mechanism originating them.)
head(fam_seg_all %>%
mutate(Above = rowSums(
select(., AAxBB:ABxBB)>0.0005)) %>%
filter(Seg0005 == "NULLS") %>%
arrange(desc(One))
)
## markerID Zero One Two AAxBB AAxAA BBxBB NULLS
## 1 8401_50 126 15 117 0.0004997501 0.0004997501 0.0004997501 0.0009995002
## 2 56458_19 143 12 105 0.0004997501 0.0004997501 0.0004997501 0.0019990005
## 3 7399_44 122 10 116 0.0004997501 0.0004997501 0.0004997501 0.0584707646
## 4 21073_27 144 10 106 0.0004997501 0.0004997501 0.0004997501 0.0064967516
## 5 5670_52 114 9 105 0.0004997501 0.0004997501 0.0004997501 0.0814592704
## 6 78356_61 124 9 104 0.0004997501 0.0004997501 0.0004997501 0.0659670165
## ABxAB ABxAA ABxBB Above Seg0005 Seg001 Seg005 Seg01
## 1 0.0004997501 0.0004997501 0.0004997501 1 NULLS NoSeg NoSeg NoSeg
## 2 0.0004997501 0.0004997501 0.0004997501 1 NULLS NULLS NoSeg NoSeg
## 3 0.0004997501 0.0004997501 0.0004997501 1 NULLS NULLS NULLS NULLS
## 4 0.0004997501 0.0004997501 0.0004997501 1 NULLS NULLS NULLS NoSeg
## 5 0.0004997501 0.0004997501 0.0004997501 1 NULLS NULLS NULLS NULLS
## 6 0.0004997501 0.0004997501 0.0004997501 1 NULLS NULLS NULLS NULLS
## Seg05
## 1 NoSeg
## 2 NoSeg
## 3 NULLS
## 4 NoSeg
## 5 NULLS
## 6 NULLS
We have one marker with an ‘error rate’ (used loosely here) of 5%, while all other markers have an error rate < 3-4 %.
Furthermore, it seems that if we used an alpha of 0.005 instead of 0.0005 we would filter markers with more than 10 individuals being genotyped as AB (of whome we would expect 0 for these markers).
head(fam_seg_all %>%
mutate(Above = rowSums(
select(., AAxBB:ABxBB)>0.005)) %>%
filter(Seg0005 == "NULLS") %>%
filter(NULLS > 0.005) %>%
arrange(desc(One))
)
## markerID Zero One Two AAxBB AAxAA BBxBB NULLS
## 1 7399_44 122 10 116 0.0004997501 0.0004997501 0.0004997501 0.058470765
## 2 21073_27 144 10 106 0.0004997501 0.0004997501 0.0004997501 0.006496752
## 3 5670_52 114 9 105 0.0004997501 0.0004997501 0.0004997501 0.081459270
## 4 78356_61 124 9 104 0.0004997501 0.0004997501 0.0004997501 0.065967016
## 5 13527_8 143 7 112 0.0004997501 0.0004997501 0.0004997501 0.110444778
## 6 21772_25 115 7 99 0.0004997501 0.0004997501 0.0004997501 0.257371314
## ABxAB ABxAA ABxBB Above Seg0005 Seg001 Seg005 Seg01
## 1 0.0004997501 0.0004997501 0.0004997501 1 NULLS NULLS NULLS NULLS
## 2 0.0004997501 0.0004997501 0.0004997501 1 NULLS NULLS NULLS NoSeg
## 3 0.0004997501 0.0004997501 0.0004997501 1 NULLS NULLS NULLS NULLS
## 4 0.0004997501 0.0004997501 0.0004997501 1 NULLS NULLS NULLS NULLS
## 5 0.0004997501 0.0004997501 0.0004997501 1 NULLS NULLS NULLS NULLS
## 6 0.0004997501 0.0004997501 0.0004997501 1 NULLS NULLS NULLS NULLS
## Seg05
## 1 NULLS
## 2 NoSeg
## 3 NULLS
## 4 NULLS
## 5 NULLS
## 6 NULLS
Let’s look at how easily NULLS markers are identified by assessing to how many segregations they would be assigned at an alpha value of 0.0005.
head(fam_seg_all %>%
mutate(Above = rowSums(
select(., AAxBB:ABxBB)>0.0005)) %>%
filter(Seg0005 == "NULLS" & Above > 1) %>%
arrange(One)
)
## [1] markerID Zero One Two AAxBB AAxAA BBxBB NULLS
## [9] ABxAB ABxAA ABxBB Above Seg0005 Seg001 Seg005 Seg01
## [17] Seg05
## <0 rows> (or 0-length row.names)
With an alpha value of 0.0005 no NULL markers are assigned to more than one segregation.
Let’s look at NULL alleles with the lowest p-values.
head(fam_seg_all %>%
mutate(Above = rowSums(
select(., AAxBB:ABxBB)>0.0005)) %>%
filter(Seg0005 == "NULLS") %>%
filter(Above > 0) %>%
arrange(NULLS) %>%
select(markerID:Two,NULLS)
)
## markerID Zero One Two NULLS
## 1 8401_50 126 15 117 0.0009995002
## 2 51449_58 131 1 70 0.0009995002
## 3 27575_56 141 0 90 0.0014992504
## 4 56458_19 143 12 105 0.0019990005
## 5 40616_56 121 2 74 0.0029985007
## 6 6575_8 149 7 98 0.0039980010
These include the two markers with more than 10 erroneous genotypes. Having an alpha value of 0.005 in this case would filter those two markers out.
nrow(fam_seg_all %>%
filter(Seg0005 == "NULLS") %>%
filter(Above > 0) %>%
filter(NULLS <= 0.005)
)
## [1] 7
Further, it would remove only 7 markers total, so a = 0.005 is probably a good threshold for NULLS markers.
Once again, let’s start by seeing if these markers are consistently assigned to one segregation only at a non-conservative alpha value of 0.0005.
head(fam_seg_all %>%
mutate(Above = rowSums
(select(., AAxBB:ABxBB)>0.0005)) %>%
filter(Seg0005 == "ABxAB" & Above > 1)
)
## [1] markerID Zero One Two AAxBB AAxAA BBxBB NULLS
## [9] ABxAB ABxAA ABxBB Above Seg0005 Seg001 Seg005 Seg01
## [17] Seg05
## <0 rows> (or 0-length row.names)
There are no markers assigned to more than 1 segregation. Good result!
In light of this, let’s check the distribution of segregation with p-value > 0.0005 for more than 1 segregation type.
temp2 <- fam_seg_all %>%
mutate(Above = rowSums(
select(., AAxBB:ABxBB)>0.0005)) %>%
filter(Above > 1)
table(temp2$Seg0005)
## < table of extent 0 >
There are no markers assigned to more than 1 segregation. From applying this workflow to other families with less than ~150 individuals, I know that this is a result of our large number of offspring (~300), as for smaller families I would get ~50 markers at least assigned to more than 1 segregration.
Let’s now try and plot the number of markers assigned to each segregation across different alpha values
alpha <- c(0.0005,0.001,0.005,0.01,0.05)
t1 <- table(fam_seg_all$Seg0005)
t2 <- table(fam_seg_all$Seg001)
t3 <- table(fam_seg_all$Seg005)
t4 <- table(fam_seg_all$Seg01)
t5 <- table(fam_seg_all$Seg05)
t6 <- rbind(t1,t2,t3,t4,t5)
## Warning in rbind(t1, t2, t3, t4, t5): number of columns of result is not a
## multiple of vector length (arg 4)
t6 <- cbind(t6,alpha)
t6[4,] <- c(0,843,8083,1958,5253,208,0.01)
t6[5,] <- c(0,792,3049,1795,10548,161,0.05)
fam_seg_count <- t6 %>% as_tibble %>%
pivot_longer(!alpha, names_to="Seg",values_to="Count")
rm(t1,t2,t3,t4,t5,t6)
And plot it
ggplot(fam_seg_count,
aes(x=alpha, y=Count, colour=Seg)) +
geom_point() + geom_line() +
scale_x_continuous(trans='log10') +
theme_classic()
From this graph it seems obvious that 0.05 is too strict, as > 10,000 markers end up not being assigned to a segregation. I will thus assign a segregation to markers with a threshold of 0.005, which removes most of the markers we found with more than 10 erroneous offsprings while retaining a good amount of markers.
#remove markers assigned to more than 1 segregation
#should be zero. This is a leftover from analysing a larger dataset of 4 families. Still thought I'd leave it
fam_seg <- fam_seg_all %>%
mutate(Above = rowSums(
select(., AAxBB:ABxBB)>0.0005)) %>%
filter(Above < 2)
#now let's assign a segregation only if alpha value is above 0.005
#IMPORTANT: THIS MEANS WE ASSIGN SEGREGATION ONLY IF OBSERVED GENOTYPE PROPORTIONS ARE NOT SIGNIFICANTLY DIFFERENT FROM EXPECTED PROPORTIONS FOR THAT SEGREGATION.
y <- fam_seg %>% select(AAxBB:ABxBB)
Segregation <- apply(y, 1, function(x) {
ifelse(any(x > 0.005), names(x)[x > 0.005], "NoSeg") })
fam_seg <- cbind(fam_seg[,1:11],Segregation)
Let’s look at number of markers per segregation now
table(fam_seg$Segregation)
##
## AAxAA AAxBB ABxAA ABxAB NoSeg NULLS
## 4 863 8607 2005 4652 214
Most markers are uniparental, which are the preferred markers for linkage mapping. 2002 markers come from a double heterozygote cross, 4652 could not be assigned to a segregation and 214 are NULL markers.
Let’s check the extremes for our informative segregation once more, to confirm we only retained markers for which we can confidently assign a segergation.
ABxAA The least AAs:
head(fam_seg %>%
filter(Segregation == "ABxAA") %>%
arrange(Zero) %>%
select(markerID:Two,ABxAA)
)
## markerID Zero One Two ABxAA
## 1 43206_68 73 106 8 0.006996502
## 2 6354_66 77 108 0 0.011994003
## 3 38636_16 77 112 0 0.008995502
## 4 20262_64 78 107 0 0.010994503
## 5 23675_31 78 113 1 0.011994003
## 6 23675_35 78 115 0 0.005497251
Skewed but few (<10 erroneous genotypes).
Now the least ABs
head(fam_seg %>%
filter(Segregation == "ABxAA") %>%
arrange(One) %>%
select(markerID:Two,ABxAA)
)
## markerID Zero One Two ABxAA
## 1 35205_52 114 74 2 0.010994503
## 2 64937_19 116 74 2 0.013493253
## 3 2713_25 113 75 1 0.009995002
## 4 18761_24 121 75 3 0.006996502
## 5 27228_46 116 75 5 0.015492254
## 6 38740_16 116 75 1 0.008495752
Same story, skewed but few (<6) erroneous genotypes.
Now the most BB, which we shold have 0 of.
head(fam_seg %>%
filter(Segregation == "ABxAA") %>%
arrange(desc(Two)) %>%
select(markerID:Two,ABxAA)
)
## markerID Zero One Two ABxAA
## 1 13007_16 148 131 13 0.010494753
## 2 13007_47 148 131 13 0.005497251
## 3 137542_36 156 134 13 0.009495252
## 4 6086_55 126 107 12 0.005497251
## 5 6377_17 138 142 12 0.039480260
## 6 6377_18 138 142 12 0.028985507
A maximum of 13, which is less than 5% error rate. Acceptable. We will filter those genotype calls out later.
ABx– Let’s repeat those stage for ABx– markers
head(fam_seg %>%
filter(Segregation == "NULLS") %>%
arrange(Zero) %>%
select(markerID:Two,NULLS)
)
## markerID Zero One Two NULLS
## 1 30571_49 94 3 95 0.9445277
## 2 30571_66 94 2 98 0.6246877
## 3 84861_58 99 0 101 0.1194403
## 4 33265_46 100 3 105 0.7906047
## 5 210906_31 100 2 115 0.3323338
## 6 210906_32 100 2 116 0.2983508
head(fam_seg %>%
filter(Segregation == "NULLS") %>%
arrange(desc(One)) %>%
select(markerID:Two,NULLS)
)
## markerID Zero One Two NULLS
## 1 7399_44 122 10 116 0.058470765
## 2 21073_27 144 10 106 0.006496752
## 3 5670_52 114 9 105 0.081459270
## 4 78356_61 124 9 104 0.065967016
## 5 13527_8 143 7 112 0.110444778
## 6 21772_25 115 7 99 0.257371314
head(fam_seg %>%
filter(Segregation == "NULLS") %>%
arrange(Two) %>%
select(markerID:Two,NULLS)
)
## markerID Zero One Two NULLS
## 1 33512_30 111 6 85 0.12093953
## 2 83568_57 115 0 89 0.02698651
## 3 33574_28 126 6 90 0.04697651
## 4 83568_25 116 0 90 0.02348826
## 5 3244_44 105 0 91 0.08245877
## 6 13823_51 122 2 91 0.04997501
Maximum of 10 erroneous individuals and genotype ratios very close to 1:0:1 in all 3 extreme cases.
ABxAB And for ABxAB markers.
head(fam_seg %>%
filter(Segregation == "ABxAB") %>%
arrange(Zero) %>%
select(markerID:Two,ABxAB)
)
## markerID Zero One Two ABxAB
## 1 176909_14 38 106 41 0.1459270
## 2 176909_32 38 106 41 0.1279360
## 3 1296_35 41 108 41 0.1854073
## 4 1296_20 42 106 42 0.2893553
## 5 1296_51 42 106 41 0.2538731
## 6 43457_39 44 104 40 0.3153423
head(fam_seg %>%
filter(Segregation == "ABxAB") %>%
arrange(One) %>%
select(markerID:Two,ABxAB)
)
## markerID Zero One Two ABxAB
## 1 90277_30 62 71 54 0.005497251
## 2 79093_58 57 73 56 0.013493253
## 3 4248_51 60 74 56 0.011994003
## 4 962_46 60 75 57 0.007496252
## 5 24044_13 61 75 57 0.009495252
## 6 55621_42 62 75 54 0.006996502
head(fam_seg %>%
filter(Segregation == "ABxAB") %>%
arrange(Two) %>%
select(markerID:Two,ABxAB)
)
## markerID Zero One Two ABxAB
## 1 1335_25 62 98 33 0.014992504
## 2 1335_68 62 95 34 0.013493253
## 3 121317_30 59 105 34 0.028485757
## 4 780_8 62 93 35 0.024487756
## 5 6690_10 66 90 35 0.005497251
## 6 46_28 49 109 37 0.120939530
This approach seems to work well. For now I feel confident that filtering by segregation distortion will deal with most of the dubious markers in these three categories (ABxAA, ABxAB and ABx–).
Let’s define the three categories
SegAss <- c("ABxAA","ABxAB","NULLS")
NoSeg <- c("NoSeg")
UnInf <- c("AAxBB","AAxAA","BBxBB")
Let’s re-display number of markers in each segregation, to check output’s accuracy
table(fam_seg$Segregation)
##
## AAxAA AAxBB ABxAA ABxAB NoSeg NULLS
## 4 863 8607 2005 4652 214
And now let’s save an R object for each category
fam_UnInf <- fam_seg %>% filter(Segregation %in% UnInf)
nrow(fam_UnInf)
## [1] 867
#save(fam_UnInf, file = #"../1_DataPrep/3_2_FilteredMarkerData_Uninformative.RData")
fam_SegAss <- fam_seg %>% filter(Segregation %in% SegAss)
nrow(fam_SegAss)
## [1] 10826
#saveRDS(fam_SegAss, file = #"../1_DataPrep/3_2_FilteredMarkerData_Informative.RData")
fam_NoSeg <- fam_seg %>% filter(Segregation %in% NoSeg)
nrow(fam_NoSeg)
## [1] 4652
#save(fam_NoSeg, file = #"../1_DataPrep/3_2_FilteredMarkerData_NoSegAss.RData")
Done! In this section we have identified an ideal alpha value to distinguish between markers being correctly assigned to a segregation or not, we have then procedeed to assign a segregation and saved three files: one with informative markers (ABxAA, ABxAB and ABx–); one with uninformative markers (AAxAA, BBxBB); and one with markers for which a segregation could not be identified.
In the next step we will silence individual genotypes and remove markers when the offspring and parent genotypes do not match.