In this section am going to filter and correct markers based on the expected mendelian segregations. I am going to do this by: - deleting markers where neither of the two parents matches the segregation observed in the offspring (e.g., the parents are AAxBB and the offspring have a 1AA:2AB:1BB propotion, thus expecting ABxAB parents) - correcting parents if correcting only one parent matches the expected segregation based on the offspring genotypes (e.g., parents are AAxAB and the offspring have a 1AA:2AB:1BB propotion, thus expecting ABxAB parents) - silence erroneour offspring genotypes (e.g., observed proportions match an AAxAB segregation, parents are indeed AAxAB but there are 2% of individuals with BB genotypes.)
First, I make sure I have a clear working environment
rm(list = ls())
And load the required packages
library(tidyverse)
Genotype data
fam_dfs <- readRDS("../1_DataPrep/2_3_FilteredMarkerData_withParents_HC.RData")
Segregation data for informative markers only
fam_SegAss <- readRDS("../1_DataPrep/3_2_FilteredMarkerData_Informative.RData")
First, let’s have a look at the numbers of markers in each
Number of markers in the genotype tibbles
## [1] 16345
Number of markers in the segregation tibbles
## [1] 10826
These are pretty good numbers of informative markers.
Now, let’s join the two dataframes, to retain only the markers in fam_SegAss. We can do this with the inner_join function, which returns all values from x where there are matching values in y.
fam_dfs <- inner_join(fam_SegAss, fam_dfs, by = "markerID")
Before we proceed, let’s check that the new genotype tibbles have the same markers as the segregation tibbles
sum(fam_dfs$markerID == fam_SegAss$markerID) == nrow(fam_SegAss)
## [1] TRUE
Finally, let’s remove the columns with p-values from the chi.square test as we don’t need those anymore. We will keep everything else.
fam_dfs <- fam_dfs %>% select(-c(AAxBB:ABxBB))
And now let’s correct parent genotypes if possible, silencing offsprings with erroneous genotypes, and delete markers where the parent segregation doesn’t match the assigned segregation.
To make my life easier in later if statements, I create a column containing the genotypes for both parents.
fam_dfs <- fam_dfs %>% tidyr::unite("parents",
ends_with("_FAT_M1"),
ends_with("_MOT_F1"),
remove = FALSE)
Now we can start correcting the parents according to the segregation assigned to that marker. There are three informative segregations: ABxAB, ABxAA and NULL alleles.
Let’s start with ABxAB. First, we break down what combinations of parent genotypes are correct, which ones can be corrected, and which ones can not.
Correct combinations
Combinations that CAN be corrected
Combinations that CAN NOT be corrected
Note that for the last group, we could correct them and change both parents to ABxAB, but for this pipeline I decided to be more conservative and retain markers only if I could correct parent pairs by changing one parent only. This is also true for markers when one parent is correct and the other was not genotyped (e.g., 2_-)
Now, if the parent genotypes are correct we won’t do anything, if they can be corrected we correct them, and if they can’t be corrected we filter them out.
First, we create a dataframe of only ABxAB markers
fam_ABxAB <- fam_dfs %>% filter(Segregation == "ABxAB")
Let’s check parent genotypes for such markers
table(fam_ABxAB$parents)
##
## -1_-1 -1_1 0_-1 0_2 1_-1 1_1 2_-1 2_0 2_1
## 268 118 1 34 293 1244 2 44 1
Most parent pairs are already correct (both heterozygous 1), which is a good sign (1244 out of 2005). These are the markers for which we have the most confidence of them being correct. Let’s make a column assigning them the CORR tag, and all other markers the ERR tag.
For the remaining markers, we can see there are 2 categories that contain most of the remaining markers, -1_1 and 1_-1. These represent markers where one parent is heterozygous and the other wasn’t successfully sequenced. We will correct only those ones in this instance, as the other ones that could be corrected represent very small proportions, and thus the risk/return factor is not very good (as in, we would be retaining possibly incorrect markers, but there’s only a few of those so not worth the risk).
Finally, we will leave all markers where neither parent is heterozygous and filter them out.
fam_ABxAB <- fam_ABxAB %>%
#assign tag "CORR" if parents already correct
mutate(tag = ifelse(parents == "1_1", "CORR", "ERR")) %>%
#correct parents from "-1_1" to "1_1"
mutate(parents = case_when(
parents == "-1_1" ~ "1_1",
parents == "1_-1" ~ "1_1",
TRUE ~ parents)) %>%
#filter if parents are not "1_1" at this stage
filter(parents == "1_1") %>%
#split genotype of corrected parents
separate(parents,c("FAT_M1","MOT_F1"),sep = "_") %>%
#remove old uncorrected genotypes
select(-ends_with("_FAT_M1"),-ends_with("_MOT_F1"))
Check how many markers are retained
nrow(fam_ABxAB)
## [1] 1655
Check how many were correct from the start and how many were corrected
table(fam_ABxAB$tag)
##
## CORR ERR
## 1244 411
Correct combinations
In this case, we will retain only markers where the parents already have the right genotype. But we will have to recode them. The -1 parent will become AA, AA offspring will become AB, and BB offspring will stay the same. AB offspring on the other hand will be silenced to -1.
fam_NULLS <- fam_dfs %>% filter(Segregation == "NULLS")
table(fam_NULLS$parents)
##
## -1_-1 -1_1 0_0 1_-1
## 78 64 1 71
Once again, most parents already have the correct genotype (135 out of 214).
Now we can modify parents genotypes from null alleles (AB and “-1” for missing data; “-1_1” and “1_-1”), to genotype calls that can be used for linkage mapping (AB and AA; "1_0). We are basically treating the null allele as an allele. It could be anything, (A, B or even a third allele C). But for simplicity, and because all other SNPs are biallelic, we keep null alleles biallelic too.
fam_NULLS <- fam_NULLS %>%
#retain only ABx-- parents
filter(parents == "1_-1" | parents == "-1_1") %>%
#change from ABx-- to ABxAA
mutate(parents = case_when(
parents == "1_-1" ~ "1_0",
parents == "-1_1" ~ "0_1")) %>%
#separate correct parent genotypes into two separate columns
separate(parents,c("FAT_M1","MOT_F1"),sep = "_") %>%
#remove non-corrected parent genotypes
select(-ends_with("_MOT_F1"),-ends_with("_FAT_M1"))
Then we have to modify the offspring genotypes.
fam_NULLS <- fam_NULLS %>%
#mutate only offspring (all start with LS)
mutate_at(vars(starts_with("LS")),
#change AB to missing
function(d) case_when(d == 1 ~ -1,
#change BB to AB (AA stays the same)
d == 2 ~ 1,
TRUE ~ d)
)
And finally we add a column called tag, to match the tibble for other informative segregations. For null alleles, we retained only markers were both parents were correct, so we give all of them the tag CORR
fam_NULLS <- fam_NULLS %>% mutate(tag = "CORR")
Let’s check how many null markers are left for each family:
nrow(fam_NULLS)
## [1] 135
Not many. This is interesting. When I did this analysis on the genotype data called by dart 1/3 of the markers were null alleles. Processing the raw fastq reads addressed this ‘issue’ in large part. Will need to investigate why there’s this massive difference. I suspect that it is because dart doesn’t use the 3 sequencing runs for the parents to build the ‘stacks’ (i.e., loci). They thus have way lower coverage for many of the markers in the parents and end up losing those markers.
Now onto the last class, ABxAA markers. To start, let’s see the distribution of parent genotypes
fam_ABxAA <- fam_dfs %>% filter(Segregation == "ABxAA")
table(fam_ABxAA$parents)
##
## -1_-1 -1_0 -1_1 -1_2 0_-1 0_0 0_1 0_2 1_-1 1_0 1_1 2_-1 2_0
## 1175 252 244 75 620 12 2715 138 530 2551 11 98 186
Fortunately, most parents are correct (2715 and 2551 out of 8607, or 5266/8607). These will be the markers we will use to build the framework maps most likely. But to try and increase the density of our maps, let’s see what other markers we might be able to correct.
Correct combinations
Combinations that CAN be corrected by modifying parent with missing data
Combinations that COULD be corrected by modifying one parent only
Combinations that can’t be corrected by modifying one parent only
Before we correct them, we will have a look at what proportion the genotypes are in for each of the 6 parent genotype pairs that could be corrected to see how confident we are in their assigned segregation.
head(fam_ABxAA %>%
select(markerID:Two,parents) %>%
filter(parents == "0_-1"))
## markerID Zero One Two parents
## 1 124_59 154 123 1 0_-1
## 2 183_33 133 150 2 0_-1
## 3 217_21 151 145 0 0_-1
## 4 221_65 149 151 0 0_-1
## 5 324_55 158 130 4 0_-1
## 6 324_66 155 131 4 0_-1
head(fam_ABxAA %>%
select(markerID:Two,parents) %>%
filter(parents == "0_-1")) %>%
arrange(desc(Two))
## markerID Zero One Two parents
## 1 324_55 158 130 4 0_-1
## 2 324_66 155 131 4 0_-1
## 3 183_33 133 150 2 0_-1
## 4 124_59 154 123 1 0_-1
## 5 217_21 151 145 0 0_-1
## 6 221_65 149 151 0 0_-1
A maximum of 4 erroneous genotypes out of 300. This is pretty good.
head(fam_ABxAA %>%
select(markerID:Two,parents) %>%
filter(parents == "-1_0"))
## markerID Zero One Two parents
## 1 227_12 156 148 0 -1_0
## 2 227_39 156 148 0 -1_0
## 3 227_68 150 154 0 -1_0
## 4 293_42 141 131 0 -1_0
## 5 334_8 148 153 0 -1_0
## 6 334_17 147 153 0 -1_0
head(fam_ABxAA %>%
select(markerID:Two,parents) %>%
filter(parents == "-1_0")) %>%
arrange(desc(Two))
## markerID Zero One Two parents
## 1 227_12 156 148 0 -1_0
## 2 227_39 156 148 0 -1_0
## 3 227_68 150 154 0 -1_0
## 4 293_42 141 131 0 -1_0
## 5 334_8 148 153 0 -1_0
## 6 334_17 147 153 0 -1_0
Offspring genotypes for -1_0 and 0_-1 markers seem to perfectly match an ABxAA segregation, one of their parents is correct, and the other has a missing genotype. Thus, we will correct those.
head(fam_ABxAA %>%
select(markerID:Two,parents) %>%
filter(parents == "-1_1"))
## markerID Zero One Two parents
## 1 171_49 144 159 0 -1_1
## 2 175_8 135 168 0 -1_1
## 3 210_25 162 144 0 -1_1
## 4 210_54 162 144 0 -1_1
## 5 325_64 156 134 0 -1_1
## 6 668_40 131 144 0 -1_1
head(fam_ABxAA %>%
select(markerID:Two,parents) %>%
filter(parents == "-1_1")) %>%
arrange(desc(Two))
## markerID Zero One Two parents
## 1 171_49 144 159 0 -1_1
## 2 175_8 135 168 0 -1_1
## 3 210_25 162 144 0 -1_1
## 4 210_54 162 144 0 -1_1
## 5 325_64 156 134 0 -1_1
## 6 668_40 131 144 0 -1_1
head(fam_ABxAA %>%
select(markerID:Two,parents) %>%
filter(parents == "1_-1"))
## markerID Zero One Two parents
## 1 42_37 145 157 0 1_-1
## 2 42_58 158 143 0 1_-1
## 3 42_63 158 143 0 1_-1
## 4 63_62 133 151 0 1_-1
## 5 79_31 141 157 0 1_-1
## 6 90_51 156 145 0 1_-1
head(fam_ABxAA %>%
select(markerID:Two,parents) %>%
filter(parents == "1_-1")) %>%
arrange(desc(Two))
## markerID Zero One Two parents
## 1 42_37 145 157 0 1_-1
## 2 42_58 158 143 0 1_-1
## 3 42_63 158 143 0 1_-1
## 4 63_62 133 151 0 1_-1
## 5 79_31 141 157 0 1_-1
## 6 90_51 156 145 0 1_-1
The same goes for 1_-1 and -1_1 markers, so we will correct those as well.
On the other hand, there are no 1_2 and 2_1 markers, and 0_2 and 2_0 markers are few, and we would have to modify a parent call, instead of just calling a missing genotype. For these reasons, we will not correct those markers.
So, first step, let’s correct the parents from 1_-1, -1_1, 0_-1 and -1_0 markers.
fam_ABxAA <- fam_ABxAA %>%
#assign "CORR" tag if parents already correct
mutate(tag = ifelse((parents == "1_0" | parents == "0_1"),
"CORR", "ERR")) %>%
#correct parents where one correct and the other is NA
mutate(parents = case_when(
parents == "-1_0" ~ "1_0",
parents == "0_-1" ~ "0_1",
parents == "-1_1" ~ "0_1",
parents == "1_-1" ~ "1_0",
TRUE ~ parents)) %>%
#retain only correct markers at this stage
filter(parents == "0_1" | parents == "1_0") %>%
#separate parent genotypes
separate(parents,c("FAT_M1","MOT_F1"),sep = "_") %>%
#remove old parent genotypes
select(-ends_with("_MOT_F1"),-ends_with("_FAT_M1"))
Then, let’s silence all erroneous offspring
fam_ABxAA <- fam_ABxAA %>%
#correct only offspring genotypes (they start with "LS")
mutate_at(vars(starts_with("LS")),
#silence BB offspring
function(d) case_when(d == 2 ~ -1,
TRUE ~ d)
)
Check how many markers are retained
nrow(fam_ABxAA)
## [1] 6912
Check how many were correct from the start and how many were corrected
table(fam_ABxAA$tag)
##
## CORR ERR
## 5266 1646
As a final step, let’s join all the informative markers into a single tibble per family and save it.
temp <- rbind(fam_ABxAB, fam_NULLS)
fam_SNP <- rbind(temp, fam_ABxAA)
rm(temp)
Let’s check the number of rows is correct between the individual tibbles and the final merged tibble
nrow(fam_ABxAB) + nrow(fam_NULLS) + nrow(fam_ABxAA)
## [1] 8702
nrow(fam_SNP)
## [1] 8702
Compare to total number of markers to which a segregation was assigned, to assess how many were lost per family
nrow(fam_dfs) - nrow(fam_SNP)
## [1] 2124
Clear the working environment from now unwanted objects
rm(fam_ABxAA, fam_NULLS, fam_ABxAB, fam_dfs, fam_SegAss)
And save the final dataset of informative SNP markers, to which a segregation was successfully assigned, and for which parents were either correct (CORR), or could be corrected (ERR).
#saveRDS(fam_SNP, file = #"../1_DataPrep/4_1_FilteredMarkerData_Informative_HC.RData")