So far in this project we have:
Now we will match the linkage maps we produced to the chromosome level assemblies of a few other Anurans. These will include: - Bufo Bufo (GCA_905171765) - Dendropsophus ebraccatus (GCA_027789725) - Hyla sarda (GCA_029499605)
We will achieve this in 5 steps:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.0.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(pivottabler)
We conducted this step in the file 15_SexLinkedLoci_GWAS to produce a summary tables of all markers included in the study. This is Table S5 in the Supplementary material. The intermediate file, 15_5_All_Loci.fa includes the consensus sequences from all loci in fasta format.
First, we will blast all markers in our catalogue to the fasta sequences for the ref and alternate allele for the sex markers identified in Litoria aurea in Sopniewski et al. 2019.
First, I produce a fasta file containing all sequences (reference and alternate alleles both included) and save it as LitoriaAurea_Markers.fasta.
Then, within the JCU HPRC I create a blast database, and then blast the sequences in 15_5_All_Loci.fa against it.
module load conda3
source $CONDA_PROF/conda.sh
conda activate blast-2.11.0
makeblastdb -in LitoriaAurea_Markers.fasta -title LitAur -dbtype nucl -out LitAur -parse_seqids
blastn -query 1_extractConsensus/15_5_All_Loci.fa -db 2_blast/2_1_Aurea/LitAur -out 2_blast/2_2_Aurea/16_2_LitAur.txt -evalue 1e-1 -outfmt 6
Now let’s import the resulting file
read_delim(
file = "../4_blast/2_blast/16_2_LitAur.txt",
delim = "\t",
col_names = c("queryID","searchID",
"Perc","Length",
"Diff","Gaps",
"qStart","qEnd",
"sstart","send",
"Evalue","score"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## queryID = col_double(),
## searchID = col_character(),
## Perc = col_double(),
## Length = col_double(),
## Diff = col_double(),
## Gaps = col_double(),
## qStart = col_double(),
## qEnd = col_double(),
## sstart = col_double(),
## send = col_double(),
## Evalue = col_double(),
## score = col_double()
## )
## # A tibble: 1 × 12
## queryID searchID Perc Length Diff Gaps qStart qEnd sstart send Evalue
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3959 LiauCT14 94.2 69 4 0 1 69 1 69 9.24e-28
## # … with 1 more variable: score <dbl>
There’s only one match. Let’s see if this marker was mapped and whether it was identified as a sex marker in our study of L. serrata.
read_csv(file = "../3_SexLinked/15_5_SummaryTable_MapAndSex.csv") %>%
filter(Locus=="3959") %>% select(Locus, LG, SexDet, GWAS.associated,Zero_M:Two_M, Zero_F:Two_F)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## markerID = col_character(),
## fasta = col_character(),
## Segregation = col_character(),
## tag = col_character(),
## SexDet = col_character(),
## Method1 = col_logical(),
## Method2 = col_character(),
## Method5 = col_character(),
## GWAS.associated = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
## Warning: 5 parsing failures.
## row col expected actual file
## 2610 Method1 1/0/T/F/TRUE/FALSE Method1 '../3_SexLinked/15_5_SummaryTable_MapAndSex.csv'
## 4476 Method1 1/0/T/F/TRUE/FALSE Method1 '../3_SexLinked/15_5_SummaryTable_MapAndSex.csv'
## 5060 Method1 1/0/T/F/TRUE/FALSE Method1 '../3_SexLinked/15_5_SummaryTable_MapAndSex.csv'
## 7448 Method1 1/0/T/F/TRUE/FALSE Method1 '../3_SexLinked/15_5_SummaryTable_MapAndSex.csv'
## 8449 Method1 1/0/T/F/TRUE/FALSE Method1 '../3_SexLinked/15_5_SummaryTable_MapAndSex.csv'
## # A tibble: 2 × 10
## Locus LG SexDet GWAS.associated Zero_M One_M Two_M Zero_F One_F Two_F
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3959 NA <NA> non.significant 6 3 0 11 0 2
## 2 3959 NA <NA> non.significant 5 2 2 10 2 1
This marker was not placed on the linkage map (LG == NA), and it was not identified as either sex-linked nor sex-associated.
Now we will blast all fasta sequences from our stacks catalogue to the draft assemblies.
First we create a database for each of the assemblies. I downloaded the zipped fasta versions of both and placed it in the folder 4_blast/2_blast/. These were downloaded on the 8th of May 2023 from NCBI’s genome database.
Load conda/blast environment
module load conda3
source $CONDA_PROF/conda.sh
conda activate blast-2.11.0
Make one database per reference genome (run this command from folder with the 3 assemblies, all named aGenSpe.fasta.gz, where GenSpe is to be replaced with the first three letters of the Genus and species respectively).
for assembly in aBufBuf LepFal BufGar RhiMar PlaOrn
do
gunzip -c ${assembly}.fasta.gz | makeblastdb -in - -title ${assembly} -dbtype nucl -out ${assembly}
done
Then I can blast my consensus fasta sequences to each database
for assembly in aBufBuf LepFal BufGar RhiMar PlaOrn
do
blastn -query 1_extractConsensus/15_5_All_Loci.fa -db 2_blast/${assembly} -out 2_blast/16_2_${assembly}.txt -evalue 1e-15 -outfmt 6
done
I now have all the hits in tabular format. Let’s read in those files to assess the confidence in the hits, and remove any hits where more than 1 hit was found.
# i can't remember how to lapply through list of names at the moment, will change later
matches.buf <- read_delim(
"../4_blast/2_blast/16_2_aBufBuf.txt",
delim = "\t",
col_names = c("queryID","searchID",
"Perc","Length",
"Diff","Gaps",
"qStart","qEnd",
"sstart","send",
"Evalue","score"))
matches.lep <- read_delim(
"../4_blast/2_blast/16_2_LepFal.txt",
delim = "\t",
col_names = c("queryID","searchID",
"Perc","Length",
"Diff","Gaps",
"qStart","qEnd",
"sstart","send",
"Evalue","score"))
matches.gar <- read_delim(
"../4_blast/2_blast/16_2_BufGar.txt",
delim = "\t",
col_names = c("queryID","searchID",
"Perc","Length",
"Diff","Gaps",
"qStart","qEnd",
"sstart","send",
"Evalue","score"))
matches.pla <- read_delim(
"../4_blast/2_blast/16_2_PlaOrn.txt",
delim = "\t",
col_names = c("queryID","searchID",
"Perc","Length",
"Diff","Gaps",
"qStart","qEnd",
"sstart","send",
"Evalue","score"))
matches.rhi <- read_delim(
"../4_blast/2_blast/16_2_RhiMar.txt",
delim = "\t",
col_names = c("queryID","searchID",
"Perc","Length",
"Diff","Gaps",
"qStart","qEnd",
"sstart","send",
"Evalue","score"))
my.matches <- list(matches.buf, matches.lep, matches.gar, matches.rhi, matches.pla)
names(my.matches) <- c("Bufo","Leptodactylus","Gargarizans", "Rhinella", "Platyplectrum")
Now there are a few things we need to do:
We will then use the bed file to extract those regions from the Hourglass tree frog genome, to use them to query the Xenopus genome.
lapply(my.matches, function(x){
x %>%
group_by(queryID) %>%
summarise(firstE = min(Evalue),
scndE = nth(Evalue, 2, order_by = Evalue)) %>%
drop_na()
})
## $Bufo
## # A tibble: 87 × 3
## queryID firstE scndE
## <dbl> <dbl> <dbl>
## 1 438 1.55e-22 7.22e-21
## 2 556 3.36e-19 3.36e-19
## 3 638 4.29e-28 2.58e-25
## 4 1052 2.01e-21 7.22e-21
## 5 1347 2.07e-16 2.07e-16
## 6 1637 4.32e-23 4.32e-23
## 7 2060 4.35e-18 2.02e-16
## 8 3837 3.36e-19 1.21e-18
## 9 3993 4.35e-18 4.35e-18
## 10 4557 1.21e-18 1.21e-18
## # … with 77 more rows
##
## $Leptodactylus
## # A tibble: 39 × 3
## queryID firstE scndE
## <dbl> <dbl> <dbl>
## 1 438 2.2 e-23 2.2 e-23
## 2 556 1.71e-19 1.71e-19
## 3 1637 1.02e-21 1.32e-20
## 4 3993 4.76e-20 6.16e-19
## 5 5500 4.98e-20 1.79e-19
## 6 10288 2.86e-17 3.7 e-16
## 7 14278 1.32e-20 6.16e-19
## 8 15540 6.11e-24 6.11e-24
## 9 22154 1.71e-19 2.21e-18
## 10 22582 6.16e-19 2.21e-18
## # … with 29 more rows
##
## $Gargarizans
## # A tibble: 82 × 3
## queryID firstE scndE
## <dbl> <dbl> <dbl>
## 1 438 1.81e-21 1.81e-21
## 2 556 6.51e-21 3.92e-18
## 3 638 6.51e-21 6.51e-21
## 4 1052 6.51e-21 8.42e-20
## 5 1347 1.87e-16 1.87e-16
## 6 1637 1.81e-21 1.81e-21
## 7 2060 3.92e-18 3.92e-18
## 8 3837 3.03e-19 3.03e-19
## 9 3993 3.92e-18 6.55e-16
## 10 6728 5.15e-22 5.15e-22
## # … with 72 more rows
##
## $Rhinella
## # A tibble: 63 × 3
## queryID firstE scndE
## <dbl> <dbl> <dbl>
## 1 302 2.25e-18 1.05e-16
## 2 438 2.23e-23 8.04e-23
## 3 556 1.74e-19 1.74e-19
## 4 1637 2.23e-23 8.04e-23
## 5 3837 1.74e-19 6.26e-19
## 6 4557 3.74e-21 3.74e-21
## 7 5500 2.35e-18 2.35e-18
## 8 6371 3.77e-16 3.77e-16
## 9 7017 6.26e-19 1.05e-16
## 10 7627 2.22e-28 1.74e-19
## # … with 53 more rows
##
## $Platyplectrum
## # A tibble: 35 × 3
## queryID firstE scndE
## <dbl> <dbl> <dbl>
## 1 638 2.06e-20 9.58e-19
## 2 1052 7.41e-20 7.41e-20
## 3 7561 9.73e-24 1.26e-22
## 4 8236 4.46e-17 5.77e-16
## 5 13518 1.59e-21 7.41e-20
## 6 13763 5.77e-16 5.77e-16
## 7 15540 2.65e-24 2.65e-24
## 8 19075 1.22e-27 5.69e-26
## 9 20287 5.86e-21 7.58e-20
## 10 24462 9.8 e-19 9.8 e-19
## # … with 25 more rows
This part is a bit hard to assess visually within R, and to be brutally honest I tried writing a script ot retain the first hit only if the second is less than 5 order of magnitude, but I failed. So i opened the file in excel and went through them and identified matches that meet that condition.
keep.buf <- c(6371,9229,13763,18474,28399,30236,37034,81008,185510,205046)
keep.gar <- c(2060,6728,28399,34600,39637,78783,125009,160312)
keep.rhi <- c(7627)
keep.lep <- c(158095,176112,208728,219746)
#keep.pla <- c() no match matched these conditions for platyplectrum
Now we will retain those, and remove all other multiple matches
#extract matches to keep, then retain only top hit
keep.buf <- my.matches$Bufo %>%
filter(queryID %in% keep.buf) %>%
group_by(queryID) %>%
slice_min(order_by = Evalue, n=1,with_ties = FALSE) %>% ungroup()
#extract list of queryIDs to remove (those with multiple matches)
drop.buf <- my.matches$Bufo %>%
group_by(queryID) %>% tally() %>%
filter(n>1) %>% pull(queryID)
#drop if query ID in list
dropped.buf <- my.matches$Bufo %>%
filter(!queryID %in% drop.buf)
#merge the two
my.buf <- rbind(keep.buf,dropped.buf)
#replace in original list
my.matches$Bufo <- my.buf
#remove clutter
rm(my.buf,dropped.buf,drop.buf,keep.buf)
keep.gar <- my.matches$Gargarizans %>%
filter(queryID %in% keep.gar) %>%
group_by(queryID) %>%
slice_min(order_by = Evalue, n=1,with_ties = FALSE) %>% ungroup()
drop.gar <- my.matches$Gargarizans %>%
group_by(queryID) %>% tally() %>%
filter(n>1) %>% pull(queryID)
dropped.gar <- my.matches$Gargarizans %>%
filter(!queryID %in% drop.gar)
my.gar <- rbind(keep.gar,dropped.gar)
my.matches$Gargarizans <- my.gar
rm(my.gar,dropped.gar,drop.gar,keep.gar)
keep.lep <- my.matches$Leptodactylus %>%
filter(queryID %in% keep.lep) %>%
group_by(queryID) %>%
slice_min(order_by = Evalue, n=1,with_ties = FALSE) %>% ungroup()
drop.lep <- my.matches$Leptodactylus %>%
group_by(queryID) %>% tally() %>%
filter(n>1) %>% pull(queryID)
dropped.lep <- my.matches$Leptodactylus %>%
filter(!queryID %in% drop.lep)
my.lep <- rbind(keep.lep,dropped.lep)
my.matches$Leptodactylus <- my.lep
rm(my.lep,dropped.lep,drop.lep,keep.lep)
keep.rhi <- my.matches$Rhinella %>%
filter(queryID %in% keep.rhi) %>%
group_by(queryID) %>%
slice_min(order_by = Evalue, n=1,with_ties = FALSE) %>% ungroup()
drop.rhi <- my.matches$Rhinella %>%
group_by(queryID) %>% tally() %>%
filter(n>1) %>% pull(queryID)
dropped.rhi <- my.matches$Rhinella %>%
filter(!queryID %in% drop.rhi)
my.rhi <- rbind(keep.rhi,dropped.rhi)
my.matches$Rhinella <- my.rhi
rm(my.rhi,dropped.rhi,drop.rhi,keep.rhi)
For Platyplectrum we simply remove all those with more than 1 match
drop.pla <- my.matches$Platyplectrum %>%
group_by(queryID) %>% tally() %>%
filter(n>1) %>% pull(queryID)
dropped.pla <- my.matches$Platyplectrum %>%
filter(!queryID %in% drop.pla)
my.matches$Platyplectrum <- dropped.pla
rm(dropped.pla,drop.pla)
Now let’s import the linkage map data to see which one of these occur on the linkage map as well. We will retain only one entry per locus
my.map <- read_csv(file = "../3_SexLinked/15_5_SummaryTable_MapAndSex.csv") %>%
select(Locus,LG_locus,mean_Ave) %>% unique()
Finally we merge blast results for each assembly to the linkage map data
my.matches.lm <- lapply(my.matches, function(x){
y <- left_join(x,
my.map,
by=c("queryID"="Locus")) %>%
select(queryID, searchID, LG_locus, everything()) %>%
filter(LG_locus > 0)
return(y)
})
And plot as pivot tables
qpvt(my.matches.lm$Bufo, "searchID", "LG_locus", "n()")
## 1 2 3 4 5 6 7 9 10 11 12 Total
## SUPER_1 1 3 1 1 6
## SUPER_10 1 1
## SUPER_2 1 1 1 1 4
## SUPER_3 6 2 8
## SUPER_4 3 1 4
## SUPER_5 5 5
## SUPER_6 1 1 1 3 6
## SUPER_7 3 3
## SUPER_8 1 3 4
## SUPER_9 1 1 2 4
## Total 10 4 3 7 4 3 3 3 3 4 1 45
qpvt(my.matches.lm$Gargarizans, "searchID", "LG_locus", "n()")
## 1 2 3 4 5 6 7 9 10 11 Total
## CM026465.1 1 1 2 1 5
## CM026466.1 1 5 6
## CM026467.1 4 4
## CM026468.1 4 1 5
## CM026469.1 1 2 1 1 5
## CM026470.1 1 1 3 5
## CM026471.1 2 2
## CM026472.1 1 2 3
## CM026473.1 1 1 3 5
## CM026474.1 1 1 2
## CM026475.1 1 1 2
## JABFFT010000341.1 1 1
## Total 8 5 6 5 4 3 3 3 3 5 45
qpvt(my.matches.lm$Leptodactylus, "searchID", "LG_locus", "n()")
## 1 2 3 4 5 6 7 8 9 10 11 12 Total
## ENA|CAMRHE010000001|CAMRHE010000001.1 1 1
## ENA|CAMRHE010000004|CAMRHE010000004.1 1 1
## ENA|CAMRHE010000006|CAMRHE010000006.1 1 1
## ENA|CAMRHE010000017|CAMRHE010000017.1 1 1
## ENA|CAMRHE010000020|CAMRHE010000020.1 1 1
## ENA|CAMRHE010000021|CAMRHE010000021.1 1 1
## ENA|CAMRHE010000028|CAMRHE010000028.1 1 1
## ENA|CAMRHE010000029|CAMRHE010000029.1 1 1
## ENA|CAMRHE010000034|CAMRHE010000034.1 1 1
## ENA|CAMRHE010000036|CAMRHE010000036.1 1 1
## ENA|CAMRHE010000040|CAMRHE010000040.1 1 1
## ENA|CAMRHE010000043|CAMRHE010000043.1 1 1
## ENA|CAMRHE010000051|CAMRHE010000051.1 1 1
## ENA|CAMRHE010000062|CAMRHE010000062.1 1 1
## ENA|CAMRHE010000070|CAMRHE010000070.1 1 1
## ENA|CAMRHE010000071|CAMRHE010000071.1 1 1
## ENA|CAMRHE010000081|CAMRHE010000081.1 1 1
## ENA|CAMRHE010000082|CAMRHE010000082.1 1 1
## ENA|CAMRHE010000092|CAMRHE010000092.1 1 1
## ENA|CAMRHE010000102|CAMRHE010000102.1 1 1
## ENA|CAMRHE010000105|CAMRHE010000105.1 1 1
## ENA|CAMRHE010000106|CAMRHE010000106.1 1 1
## ENA|CAMRHE010000109|CAMRHE010000109.1 1 1
## ENA|CAMRHE010000112|CAMRHE010000112.1 1 1
## ENA|CAMRHE010000133|CAMRHE010000133.1 1 1
## ENA|CAMRHE010000137|CAMRHE010000137.1 1 1
## ENA|CAMRHE010000142|CAMRHE010000142.1 1 1
## ENA|CAMRHE010000155|CAMRHE010000155.1 1 1
## ENA|CAMRHE010000195|CAMRHE010000195.1 1 1
## ENA|CAMRHE010000209|CAMRHE010000209.1 1 1
## ENA|CAMRHE010000266|CAMRHE010000266.1 2 2
## ENA|CAMRHE010000302|CAMRHE010000302.1 1 1
## ENA|CAMRHE010000323|CAMRHE010000323.1 1 1
## ENA|CAMRHE010000341|CAMRHE010000341.1 1 1
## ENA|CAMRHE010000361|CAMRHE010000361.1 1 1
## ENA|CAMRHE010000381|CAMRHE010000381.1 1 1
## ENA|CAMRHE010000392|CAMRHE010000392.1 1 1
## ENA|CAMRHE010000448|CAMRHE010000448.1 1 1
## ENA|CAMRHE010000456|CAMRHE010000456.1 1 1
## ENA|CAMRHE010000459|CAMRHE010000459.1 1 1
## ENA|CAMRHE010000473|CAMRHE010000473.1 1 1
## ENA|CAMRHE010000518|CAMRHE010000518.1 1 1
## ENA|CAMRHE010000580|CAMRHE010000580.1 1 1
## ENA|CAMRHE010000591|CAMRHE010000591.1 1 1
## ENA|CAMRHE010000648|CAMRHE010000648.1 1 1
## ENA|CAMRHE010000692|CAMRHE010000692.1 1 1
## ENA|CAMRHE010000733|CAMRHE010000733.1 1 1
## ENA|CAMRHE010000802|CAMRHE010000802.1 1 1
## ENA|CAMRHE010000835|CAMRHE010000835.1 1 1
## ENA|CAMRHE010000844|CAMRHE010000844.1 1 1
## ENA|CAMRHE010001411|CAMRHE010001411.1 1 1
## Total 9 6 9 7 3 3 2 2 1 3 5 2 52
qpvt(my.matches.lm$Rhinella, "searchID", "LG_locus", "n()")
## 1 2 3 4 5 6 7 9 10 11 12 Total
## ONZH01002185.1 1 1
## ONZH01002998.1 1 1
## ONZH01004089.1 1 1
## ONZH01004433.1 1 1
## ONZH01004595.1 1 1
## ONZH01005235.1 1 1
## ONZH01007232.1 1 1
## ONZH01008736.1 1 1
## ONZH01009117.1 1 1
## ONZH01009633.1 1 1
## ONZH01010448.1 1 1
## ONZH01010689.1 1 1
## ONZH01010706.1 1 1
## ONZH01011743.1 1 1
## ONZH01011827.1 1 1
## ONZH01012283.1 1 1
## ONZH01012983.1 1 1
## ONZH01013238.1 1 1
## ONZH01013881.1 1 1
## ONZH01014197.1 1 1
## ONZH01014232.1 1 1
## ONZH01014800.1 1 1
## ONZH01014979.1 1 1
## ONZH01015224.1 1 1
## ONZH01015597.1 1 1
## ONZH01015972.1 1 1
## ONZH01017199.1 1 1
## ONZH01017616.1 1 1
## ONZH01018404.1 1 1
## ONZH01018720.1 1 1
## ONZH01020851.1 1 1
## ONZH01021045.1 1 1
## ONZH01022040.1 1 1
## ONZH01022772.1 1 1
## ONZH01023712.1 1 1
## ONZH01023775.1 1 1
## ONZH01024890.1 1 1
## ONZH01027000.1 1 1
## ONZH01028366.1 1 1
## ONZH01028847.1 1 1
## ONZH01029597.1 1 1
## ONZH01029888.1 1 1
## ONZH01030037.1 1 1
## Total 5 8 5 5 3 2 4 2 4 4 1 43
qpvt(my.matches.lm$Platyplectrum, "searchID", "LG_locus", "n()")
## 2 3 6 8 11 Total
## JABWIB010009571.1 1 1
## JABWIB010013348.1 1 1
## JABWIB010017146.1 1 1
## JABWIB010017960.1 1 1
## JABWIB010021123.1 1 1
## JABWIB010065037.1 1 1
## Total 2 1 1 1 1 6
Now we create a new sstart and send column where sstart is always < than send, and add
my.matches.lm <- lapply(my.matches.lm, function(x){
y <- x %>% mutate(sstart2 = case_when(sstart < send ~ sstart,
sstart > send ~ send),
send2 = case_when(sstart < send ~ send,
sstart > send ~ sstart),
length2 = send2 - sstart2)
return(y)
})
Add 2kb per side to the matching region
my.matches.lm <- lapply(my.matches.lm, function(x){
y <- x %>%
mutate(sstart3 = sstart2 - 2000,
send3 = send2 + 2000)
return(y)
})
Now let’s export those in bed format (i.e., scaffoldID<tab>startPos<tab>endPos
)
write_delim(my.matches.lm$Bufo %>%
select(searchID,sstart3,send3),
file = "../4_blast/2_blast/16_3_BufBuf_matches.bam",
delim = "\t",
col_names = FALSE)
write_delim(my.matches.lm$Leptodactylus %>%
select(searchID,sstart3,send3),
file = "../4_blast/2_blast/16_3_LepFal_matches.bam",
delim = "\t",
col_names = FALSE)
write_delim(my.matches.lm$Gargarizans %>%
select(searchID,sstart3,send3),
file = "../4_blast/2_blast/16_3_BufGar_matches.bam",
delim = "\t",
col_names = FALSE)
write_delim(my.matches.lm$Rhinella %>%
select(searchID,sstart3,send3),
file = "../4_blast/2_blast/16_3_RhiMar_matches.bam",
delim = "\t",
col_names = FALSE)
write_delim(my.matches.lm$Platyplectrum %>%
select(searchID,sstart3,send3),
file = "../4_blast/2_blast/16_3_PlaOrn_matches.bam",
delim = "\t",
col_names = FALSE)
And as tables for later use
write_delim(my.matches.lm$Bufo,
file = "../4_blast/2_blast/16_3_BufBuf_matches.txt",
delim = "\t")
write_delim(my.matches.lm$Leptodactylus,
file = "../4_blast/2_blast/16_3_LepFal_matches.txt",
delim = "\t")
write_delim(my.matches.lm$Gargarizans,
file = "../4_blast/2_blast/16_3_BufGar_matches.txt",
delim = "\t")
write_delim(my.matches.lm$Rhinella,
file = "../4_blast/2_blast/16_3_RhiMar_matches.txt",
delim = "\t")
write_delim(my.matches.lm$Platyplectrum,
file = "../4_blast/2_blast/16_3_PlaOrn_matches.txt",
delim = "\t")
Now let’s extract the corresponding fasta sequences with the following command. Note, you’ll have to replace assembly.fa and matches.fa with the names of the respective matches and assembly files.
module load bedtools
bedtools getfasta -fi assembly.fa -bed matches.bam -fo matches.fa
First, I download the X. tropicalis genome with the following command:
rsync --copy-links --recursive --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/004/195/GCA_000004195.4_UCB_Xtro_10.0/GCA_000004195.4_UCB_Xtro_10.0_genomic.fna.gz ./3_blast/3_1_xtrop/
I then create a database
conda activate blast-2.11.0
makeblastdb -in GCA_000004195.4_UCB_Xtro_10.0_genomic.fna -title Xtro10 -dbtype nucl -out Xtro10 -parse_seqids
And finally query the 4kb windows I extracted from the hourglass tree frog genome onto the Xenopus genome.
for assembly in BufBuf BufGar LepFal PlaOrn RhiMar
do
blastn -query 2_blast/2_2_matches/16_4_${assembly}_matches.fa -db 3_blast_XenLaevis/3_1_xtrop/Xtro10 -out 3_blast_XenLaevis/16_5_${assembly}2XenTrop.txt -evalue 1e-15 -outfmt 6
done
Now, before we merge the matches between the intermediate genomes and Xenopus, let’s retain only the best match if the difference is 5 orders of magnitude
Buf2Xen <- read_delim("../4_blast/2_blast/16_5_BufBuf2XenTrop.txt",
delim = "\t",
col_names = c("queryID","searchID",
"Perc","Length",
"Diff","Gaps",
"qStart","qEnd",
"sstart","send",
"Evalue","score"))
Gar2Xen <- read_delim("../4_blast/2_blast/16_5_BufGar2XenTrop.txt",
delim = "\t",
col_names = c("queryID","searchID",
"Perc","Length",
"Diff","Gaps",
"qStart","qEnd",
"sstart","send",
"Evalue","score"))
Lep2Xen <- read_delim("../4_blast/2_blast/16_5_LepFal2XenTrop.txt",
delim = "\t",
col_names = c("queryID","searchID",
"Perc","Length",
"Diff","Gaps",
"qStart","qEnd",
"sstart","send",
"Evalue","score"))
Pla2Xen <- read_delim("../4_blast/2_blast/16_5_PlaOrn2XenTrop.txt",
delim = "\t",
col_names = c("queryID","searchID",
"Perc","Length",
"Diff","Gaps",
"qStart","qEnd",
"sstart","send",
"Evalue","score"))
Rhi2Xen <- read_delim("../4_blast/2_blast/16_5_RhiMar2XenTrop.txt",
delim = "\t",
col_names = c("queryID","searchID",
"Perc","Length",
"Diff","Gaps",
"qStart","qEnd",
"sstart","send",
"Evalue","score"))
my.xen.list <- list(Buf2Xen, Gar2Xen, Lep2Xen, Pla2Xen, Rhi2Xen)
names(my.xen.list) <- c("Bufo","Gargarizans","Leptodactylus","Platyplectrum","Rhinella")
rm(Buf2Xen,Gar2Xen,Lep2Xen,Pla2Xen,Rhi2Xen)
lapply(my.xen.list, function(x){
x %>% group_by(queryID) %>%
summarise(firstE = min(Evalue),
scndE = nth(Evalue, 2, order_by = Evalue)) %>%
drop_na()
})
## $Bufo
## # A tibble: 8 × 3
## queryID firstE scndE
## <chr> <dbl> <dbl>
## 1 SUPER_1:709086661-709090729 0 0
## 2 SUPER_2:200820038-200824092 0 0
## 3 SUPER_2:645951670-645955731 0 0
## 4 SUPER_3:147300479-147304544 3.11e-67 4.02e- 66
## 5 SUPER_3:366739305-366743374 0 1.63e-164
## 6 SUPER_3:41116371-41120437 0 0
## 7 SUPER_4:536096509-536100577 0 7.03e- 34
## 8 SUPER_5:378670322-378674388 0 0
##
## $Gargarizans
## # A tibble: 6 × 3
## queryID firstE scndE
## <chr> <dbl> <dbl>
## 1 CM026465.1:186712537-186716598 0 0
## 2 CM026465.1:316786201-316790247 0 0
## 3 CM026467.1:518606935-518611000 6.41e-99 1.39e-95
## 4 CM026468.1:462842942-462847010 0 1.41e-85
## 5 CM026470.1:387945366-387949426 0 0
## 6 CM026473.1:186541833-186545901 0 0
##
## $Leptodactylus
## # A tibble: 8 × 3
## queryID firstE scndE
## <chr> <dbl> <dbl>
## 1 ENA|CAMRHE010000021|CAMRHE010000021.1:10133088-10137157 0 0
## 2 ENA|CAMRHE010000062|CAMRHE010000062.1:7623445-7627507 2.42e-63 9.28e- 18
## 3 ENA|CAMRHE010000209|CAMRHE010000209.1:2848462-2852526 2.53e-33 9.15e- 28
## 4 ENA|CAMRHE010000302|CAMRHE010000302.1:700758-704826 2 e-19 2 e- 19
## 5 ENA|CAMRHE010000381|CAMRHE010000381.1:852292-856359 1.07e-96 8.59e- 73
## 6 ENA|CAMRHE010000591|CAMRHE010000591.1:277239-281292 0 0
## 7 ENA|CAMRHE010000648|CAMRHE010000648.1:397223-401288 0 1.23e-180
## 8 ENA|CAMRHE010000802|CAMRHE010000802.1:280969-285021 3.3 e-22 1.54e- 20
##
## $Platyplectrum
## # A tibble: 0 × 3
## # … with 3 variables: queryID <chr>, firstE <dbl>, scndE <dbl>
##
## $Rhinella
## # A tibble: 1 × 3
## queryID firstE scndE
## <chr> <dbl> <dbl>
## 1 ONZH01010706.1:18505-22573 2.55e-28 1.98e-24
Once again, I inspected them visuall in excel to assess which matches were 4 order than magnitude different from the second best match. Below is a list of all queryID that DID NOT match the criteria (i.e., first and second match less than 4 order of magnitude different).
buf2remove <- c("SUPER_1:709086661-709090729","SUPER_2:200820038-200824092",
"SUPER_2:645951670-645955731","SUPER_3:147300479-147304544",
"SUPER_3:41116371-41120437","SUPER_5:378670322-378674388")
gar2remove <- c("CM026465.1:186712537-186716598","CM026465.1:316786201-316790247",
"CM026470.1:387945366-387949426","CM026473.1:186541833-186545901")
lep2remove <- c("ENA|CAMRHE010000021|CAMRHE010000021.1:10133088-10137157",
"ENA|CAMRHE010000302|CAMRHE010000302.1:700758-704826",
"ENA|CAMRHE010000591|CAMRHE010000591.1:277239-281292",
"ENA|CAMRHE010000802|CAMRHE010000802.1:280969-285021")
rhi2remove <- c("ONZH01010706.1:18505-22573")
So let’s remove those, and for the remanining matches retain the top one only
my.xen.list.2 <- my.xen.list
#we filter to retain only those we selected as confident matches, and then retain the first hit by e-value
my.xen.list.2$Bufo <- my.xen.list$Bufo %>%
filter(!queryID %in% buf2remove) %>%
group_by(queryID) %>%
slice_min(order_by = Evalue, n=1,with_ties = FALSE)
my.xen.list.2$Gargarizans <- my.xen.list$Gargarizans %>%
filter(!queryID %in% gar2remove) %>%
group_by(queryID) %>%
slice_min(order_by = Evalue, n=1,with_ties = FALSE)
my.xen.list.2$Leptodactylus <- my.xen.list$Leptodactylus %>%
filter(!queryID %in% lep2remove) %>%
group_by(queryID) %>%
slice_min(order_by = Evalue, n=1,with_ties = FALSE)
my.xen.list.2$Rhinella <- my.xen.list$Rhinella %>%
filter(!queryID %in% rhi2remove) %>%
group_by(queryID) %>%
slice_min(order_by = Evalue, n=1,with_ties = FALSE)
Let’s save as text files
write_delim(my.xen.list.2$Bufo,
file = "../4_blast/3_blast_XenLaevis/16_6_BufBuf2XenTrop.txt")
write_delim(my.xen.list.2$Gargarizans,
file = "../4_blast/3_blast_XenLaevis/16_6_BufGar2XenTrop.txt")
write_delim(my.xen.list.2$Leptodactylus,
file = "../4_blast/3_blast_XenLaevis/16_6_LepFal2XenTrop.txt")
write_delim(my.xen.list.2$Platyplectrum,
file = "../4_blast/3_blast_XenLaevis/16_6_PlaOrn2XenTrop.txt")
write_delim(my.xen.list.2$Rhinella,
file = "../4_blast/3_blast_XenLaevis/16_6_RhiMar2XenTrop.txt")
Now we have all the info we need. Let’s clear the R environment, and then assess to what xenopus chromosomes our linkage groups match to.
rm(list = ls())
Read in matches between L. serrata rad tags and intermediate assemblies assembly
my.Buf <- read_delim("../4_blast/2_blast/16_3_BufBuf_matches.txt", delim = "\t") %>%
add_column(Comparison="Bufo")
my.Gar <- read_delim("../4_blast/2_blast/16_3_BufGar_matches.txt", delim = "\t") %>%
add_column(Comparison="Gargarizans")
my.Lep <- read_delim("../4_blast/2_blast/16_3_LepFal_matches.txt", delim = "\t") %>%
add_column(Comparison="Leptodactylus")
my.Pla <- read_delim("../4_blast/2_blast/16_3_PlaOrn_matches.txt", delim = "\t") %>%
add_column(Comparison="Platyplectrum")
my.Rhi <- read_delim("../4_blast/2_blast/16_3_RhiMar_matches.txt", delim = "\t") %>%
add_column(Comparison="Rhinella")
my.df <- rbind(my.Buf,my.Gar,my.Lep,my.Pla,my.Rhi)
rm(my.Buf,my.Gar,my.Lep,my.Pla,my.Rhi)
Read in matches between intermediate assemblies and Xenopus assembly
Buf2Xen <- read_delim("../4_blast/3_blast_XenLaevis/16_6_BufBuf2XenTrop.txt",
delim = " ") %>% add_column(Comparison="Bufo") %>%
rename(queryID2=queryID)
Gar2Xen <- read_delim("../4_blast/3_blast_XenLaevis/16_6_BufGar2XenTrop.txt",
delim = " ") %>% add_column(Comparison="Gargarizans") %>%
rename(queryID2=queryID)
Lep2Xen <- read_delim("../4_blast/3_blast_XenLaevis/16_6_LepFal2XenTrop.txt",
delim = " ") %>% add_column(Comparison="Leptodactylus") %>%
rename(queryID2=queryID)
Pla2Xen <- read_delim("../4_blast/3_blast_XenLaevis/16_6_PlaOrn2XenTrop.txt",
delim = " ") %>% add_column(Comparison="Platyplectrum") %>%
rename(queryID2=queryID)
Rhi2Xen <- read_delim("../4_blast/3_blast_XenLaevis/16_6_RhiMar2XenTrop.txt",
delim = " ") %>% add_column(Comparison="Rhinella") %>%
rename(queryID2=queryID)
my.df2 <- rbind(Buf2Xen,Gar2Xen,Lep2Xen,Pla2Xen,Rhi2Xen)
rm(Buf2Xen,Gar2Xen,Lep2Xen,Pla2Xen,Rhi2Xen)
Now add a column that will match the search sequence id in the Intermediate assembly to Xenopus comparison:
my.df <- my.df %>%
mutate(queryID2 = paste0(searchID,":",sstart3,"-",send3))
Now we can merge with the matches to the Xenopus genome by queryID
my.df3 <- left_join(my.df2, my.df, by="queryID2",
suffix=c(".Xen",".Int"))
my.df3 <- my.df3 %>%
select(queryID2,searchID.Xen,
LG_locus, queryID, searchID.Int,
Comparison.Xen, Comparison.Int)
Let’s visualise matches between our linkage map and the Xenopus tropicalis chromosome level assembly.
qpvt(my.df3, "searchID.Xen", "LG_locus", "n()")
## 1 2 3 4 5 6 9 10 11 Total
## AAMC04000030.1 1 1
## CM004443.2 1 4 5
## CM004444.2 4 1 5
## CM004445.2 2 1 3
## CM004446.2 1 5 6
## CM004447.2 8 1 9
## CM004448.2 5 5
## CM004449.2 1 1
## CM004451.2 2 2
## Total 6 9 3 5 5 5 1 2 1 37
There seems to be some synteny. But careful! Some of these values are inflated as the same locus matched to multiple intermediate assemblies and thus now shows as having more matches to the Xenopus assembly.
Let’s filter to retain one hit per locus (multiple hits per locus are concordant). Note, I checked that when having multiple matches per locus, they all mapped to the same X. tropicalis chromosome.
my.df4 <- my.df3 %>% distinct(queryID, .keep_all = TRUE)
let’s re-run it now to check the real unique matches
qpvt(my.df4, "searchID.Xen", "LG_locus", "n()")
## 1 2 3 4 5 6 9 10 11 Total
## AAMC04000030.1 1 1
## CM004443.2 1 1 2
## CM004444.2 3 1 4
## CM004445.2 2 1 3
## CM004446.2 1 2 3
## CM004447.2 3 1 4
## CM004448.2 2 2
## CM004449.2 1 1
## CM004451.2 1 1
## Total 5 4 3 2 2 2 1 1 1 21
It does appear that there is some level of synteny between the linkage maps produced here and the 6 genome assemblies we investigated (Bufo bufo, Bufo gargarizans, Leptodactylus fallax, Platyplectrum ornatum, Rhinella marina and Xenopus tropicalis).
Unfortunately against the Xenopus chromosome level assembly we only have 21 unique hits, which, spread across 13 linkage groups and with possible re-arrangements and/or errors, make it hard to confirm whether the observed synteny represents a true signal (e.g., LG2 in our map is homologous to Chromosome CM004447 in Xenopus) or whether we obtained three hits for that linkage group by chance.