16. Matching linkage maps to X. tropicalis genome

0. Load packages
1. Extract consensus sequences for linkage map markers
2. Blast map consensus sequences to other resources
- 2.1. Blast to Litoria aurea sex markers
- 2.2. Blast to anuran assemblies
3. Extract 4kb windows from reference assemblies
4. Blast extracted regions to X. tropicalis reference genome
5. Match Linkage groups to chromosomes
6. Summary

So far in this project we have:

Called genotypes using stacks
Produced linkage maps using lepmap3
Identified sex-linked loci and a likely sex-chromosome

Now we will match the linkage maps we produced to the chromosome level assemblies of a few other Anurans. These will include: - Bufo Bufo (GCA_905171765) - Dendropsophus ebraccatus (GCA_027789725) - Hyla sarda (GCA_029499605)

We will achieve this in 5 steps:

Extract consensus sequences for linkage map markers from stacks catalogue
Align these sequences to the reference genomes
Extract 4kb regions around high confidence matches
Align extracted regions to X. tropicalis assembly
Match Litoria serrata linkage groups to X.tropicalis chromosomes

0. Load packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## Warning: package 'tidyr' was built under R version 4.0.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(pivottabler)

1. Extract consensus sequences for linkage map markers

We conducted this step in the file 15_SexLinkedLoci_GWAS to produce a summary tables of all markers included in the study. This is Table S5 in the Supplementary material. The intermediate file, 15_5_All_Loci.fa includes the consensus sequences from all loci in fasta format.

2. Blast map consensus sequences to other resources

2.1. Blast to Litoria aurea sex markers

First, we will blast all markers in our catalogue to the fasta sequences for the ref and alternate allele for the sex markers identified in Litoria aurea in Sopniewski et al. 2019.

First, I produce a fasta file containing all sequences (reference and alternate alleles both included) and save it as LitoriaAurea_Markers.fasta.

Then, within the JCU HPRC I create a blast database, and then blast the sequences in 15_5_All_Loci.fa against it.

module load conda3
source $CONDA_PROF/conda.sh
conda activate blast-2.11.0

makeblastdb -in LitoriaAurea_Markers.fasta -title LitAur -dbtype nucl -out LitAur -parse_seqids

blastn -query 1_extractConsensus/15_5_All_Loci.fa -db 2_blast/2_1_Aurea/LitAur -out 2_blast/2_2_Aurea/16_2_LitAur.txt -evalue 1e-1 -outfmt 6

Now let’s import the resulting file

read_delim(
  file = "../4_blast/2_blast/16_2_LitAur.txt",
  delim = "\t",
  col_names = c("queryID","searchID",
                "Perc","Length",
                "Diff","Gaps",
                "qStart","qEnd",
                "sstart","send",
                "Evalue","score"))

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   queryID = col_double(),
##   searchID = col_character(),
##   Perc = col_double(),
##   Length = col_double(),
##   Diff = col_double(),
##   Gaps = col_double(),
##   qStart = col_double(),
##   qEnd = col_double(),
##   sstart = col_double(),
##   send = col_double(),
##   Evalue = col_double(),
##   score = col_double()
## )

## # A tibble: 1 × 12
##   queryID searchID  Perc Length  Diff  Gaps qStart  qEnd sstart  send   Evalue
##     <dbl> <chr>    <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>    <dbl>
## 1    3959 LiauCT14  94.2     69     4     0      1    69      1    69 9.24e-28
## # … with 1 more variable: score <dbl>

There’s only one match. Let’s see if this marker was mapped and whether it was identified as a sex marker in our study of L. serrata.

read_csv(file = "../3_SexLinked/15_5_SummaryTable_MapAndSex.csv") %>%
  filter(Locus=="3959") %>% select(Locus, LG, SexDet, GWAS.associated,Zero_M:Two_M, Zero_F:Two_F)

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   markerID = col_character(),
##   fasta = col_character(),
##   Segregation = col_character(),
##   tag = col_character(),
##   SexDet = col_character(),
##   Method1 = col_logical(),
##   Method2 = col_character(),
##   Method5 = col_character(),
##   GWAS.associated = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

## Warning: 5 parsing failures.
##  row     col           expected  actual                                             file
## 2610 Method1 1/0/T/F/TRUE/FALSE Method1 '../3_SexLinked/15_5_SummaryTable_MapAndSex.csv'
## 4476 Method1 1/0/T/F/TRUE/FALSE Method1 '../3_SexLinked/15_5_SummaryTable_MapAndSex.csv'
## 5060 Method1 1/0/T/F/TRUE/FALSE Method1 '../3_SexLinked/15_5_SummaryTable_MapAndSex.csv'
## 7448 Method1 1/0/T/F/TRUE/FALSE Method1 '../3_SexLinked/15_5_SummaryTable_MapAndSex.csv'
## 8449 Method1 1/0/T/F/TRUE/FALSE Method1 '../3_SexLinked/15_5_SummaryTable_MapAndSex.csv'

## # A tibble: 2 × 10
##   Locus    LG SexDet GWAS.associated Zero_M One_M Two_M Zero_F One_F Two_F
##   <dbl> <dbl> <chr>  <chr>            <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1  3959    NA <NA>   non.significant      6     3     0     11     0     2
## 2  3959    NA <NA>   non.significant      5     2     2     10     2     1

This marker was not placed on the linkage map (LG == NA), and it was not identified as either sex-linked nor sex-associated.

2.2. Blast to anuran assemblies

Now we will blast all fasta sequences from our stacks catalogue to the draft assemblies.

First we create a database for each of the assemblies. I downloaded the zipped fasta versions of both and placed it in the folder 4_blast/2_blast/. These were downloaded on the 8th of May 2023 from NCBI’s genome database.

Load conda/blast environment

module load conda3
source $CONDA_PROF/conda.sh
conda activate blast-2.11.0

Make one database per reference genome (run this command from folder with the 3 assemblies, all named aGenSpe.fasta.gz, where GenSpe is to be replaced with the first three letters of the Genus and species respectively).

for assembly in aBufBuf LepFal BufGar RhiMar PlaOrn
do
 gunzip -c ${assembly}.fasta.gz | makeblastdb -in - -title ${assembly} -dbtype nucl -out ${assembly}
done

Then I can blast my consensus fasta sequences to each database

for assembly in aBufBuf LepFal BufGar RhiMar PlaOrn
do
  blastn -query 1_extractConsensus/15_5_All_Loci.fa -db 2_blast/${assembly} -out 2_blast/16_2_${assembly}.txt -evalue 1e-15 -outfmt 6
done

I now have all the hits in tabular format. Let’s read in those files to assess the confidence in the hits, and remove any hits where more than 1 hit was found.

# i can't remember how to lapply through list of names at the moment, will change later
matches.buf <- read_delim(
  "../4_blast/2_blast/16_2_aBufBuf.txt",
  delim = "\t",
  col_names = c("queryID","searchID",
                "Perc","Length",
                "Diff","Gaps",
                "qStart","qEnd",
                "sstart","send",
                "Evalue","score"))

matches.lep <- read_delim(
  "../4_blast/2_blast/16_2_LepFal.txt",
  delim = "\t",
  col_names = c("queryID","searchID",
                "Perc","Length",
                "Diff","Gaps",
                "qStart","qEnd",
                "sstart","send",
                "Evalue","score"))

matches.gar <- read_delim(
  "../4_blast/2_blast/16_2_BufGar.txt",
  delim = "\t",
  col_names = c("queryID","searchID",
                "Perc","Length",
                "Diff","Gaps",
                "qStart","qEnd",
                "sstart","send",
                "Evalue","score"))

matches.pla <- read_delim(
  "../4_blast/2_blast/16_2_PlaOrn.txt",
  delim = "\t",
  col_names = c("queryID","searchID",
                "Perc","Length",
                "Diff","Gaps",
                "qStart","qEnd",
                "sstart","send",
                "Evalue","score"))

matches.rhi <- read_delim(
  "../4_blast/2_blast/16_2_RhiMar.txt",
  delim = "\t",
  col_names = c("queryID","searchID",
                "Perc","Length",
                "Diff","Gaps",
                "qStart","qEnd",
                "sstart","send",
                "Evalue","score"))

my.matches <- list(matches.buf, matches.lep, matches.gar, matches.rhi, matches.pla)
names(my.matches) <- c("Bufo","Leptodactylus","Gargarizans", "Rhinella", "Platyplectrum")

Now there are a few things we need to do:

Retain only matches where 2nd match, if present, is 5 orders of magnitude smaller
Order sstart and send so that sstart < send
Expand the matching search region by 2kb per side
Export scaffold ID (i.e., searchID), sstart and send as bed file

We will then use the bed file to extract those regions from the Hourglass tree frog genome, to use them to query the Xenopus genome.

lapply(my.matches, function(x){
  x %>%
  group_by(queryID) %>%
  summarise(firstE = min(Evalue),
            scndE = nth(Evalue, 2, order_by = Evalue)) %>%
  drop_na()
})

## $Bufo
## # A tibble: 87 × 3
##    queryID   firstE    scndE
##      <dbl>    <dbl>    <dbl>
##  1     438 1.55e-22 7.22e-21
##  2     556 3.36e-19 3.36e-19
##  3     638 4.29e-28 2.58e-25
##  4    1052 2.01e-21 7.22e-21
##  5    1347 2.07e-16 2.07e-16
##  6    1637 4.32e-23 4.32e-23
##  7    2060 4.35e-18 2.02e-16
##  8    3837 3.36e-19 1.21e-18
##  9    3993 4.35e-18 4.35e-18
## 10    4557 1.21e-18 1.21e-18
## # … with 77 more rows
## 
## $Leptodactylus
## # A tibble: 39 × 3
##    queryID   firstE    scndE
##      <dbl>    <dbl>    <dbl>
##  1     438 2.2 e-23 2.2 e-23
##  2     556 1.71e-19 1.71e-19
##  3    1637 1.02e-21 1.32e-20
##  4    3993 4.76e-20 6.16e-19
##  5    5500 4.98e-20 1.79e-19
##  6   10288 2.86e-17 3.7 e-16
##  7   14278 1.32e-20 6.16e-19
##  8   15540 6.11e-24 6.11e-24
##  9   22154 1.71e-19 2.21e-18
## 10   22582 6.16e-19 2.21e-18
## # … with 29 more rows
## 
## $Gargarizans
## # A tibble: 82 × 3
##    queryID   firstE    scndE
##      <dbl>    <dbl>    <dbl>
##  1     438 1.81e-21 1.81e-21
##  2     556 6.51e-21 3.92e-18
##  3     638 6.51e-21 6.51e-21
##  4    1052 6.51e-21 8.42e-20
##  5    1347 1.87e-16 1.87e-16
##  6    1637 1.81e-21 1.81e-21
##  7    2060 3.92e-18 3.92e-18
##  8    3837 3.03e-19 3.03e-19
##  9    3993 3.92e-18 6.55e-16
## 10    6728 5.15e-22 5.15e-22
## # … with 72 more rows
## 
## $Rhinella
## # A tibble: 63 × 3
##    queryID   firstE    scndE
##      <dbl>    <dbl>    <dbl>
##  1     302 2.25e-18 1.05e-16
##  2     438 2.23e-23 8.04e-23
##  3     556 1.74e-19 1.74e-19
##  4    1637 2.23e-23 8.04e-23
##  5    3837 1.74e-19 6.26e-19
##  6    4557 3.74e-21 3.74e-21
##  7    5500 2.35e-18 2.35e-18
##  8    6371 3.77e-16 3.77e-16
##  9    7017 6.26e-19 1.05e-16
## 10    7627 2.22e-28 1.74e-19
## # … with 53 more rows
## 
## $Platyplectrum
## # A tibble: 35 × 3
##    queryID   firstE    scndE
##      <dbl>    <dbl>    <dbl>
##  1     638 2.06e-20 9.58e-19
##  2    1052 7.41e-20 7.41e-20
##  3    7561 9.73e-24 1.26e-22
##  4    8236 4.46e-17 5.77e-16
##  5   13518 1.59e-21 7.41e-20
##  6   13763 5.77e-16 5.77e-16
##  7   15540 2.65e-24 2.65e-24
##  8   19075 1.22e-27 5.69e-26
##  9   20287 5.86e-21 7.58e-20
## 10   24462 9.8 e-19 9.8 e-19
## # … with 25 more rows

This part is a bit hard to assess visually within R, and to be brutally honest I tried writing a script ot retain the first hit only if the second is less than 5 order of magnitude, but I failed. So i opened the file in excel and went through them and identified matches that meet that condition.

keep.buf <- c(6371,9229,13763,18474,28399,30236,37034,81008,185510,205046)
keep.gar <- c(2060,6728,28399,34600,39637,78783,125009,160312)
keep.rhi <- c(7627)
keep.lep <- c(158095,176112,208728,219746)
#keep.pla <- c() no match matched these conditions for platyplectrum

Now we will retain those, and remove all other multiple matches

#extract matches to keep, then retain only top hit
keep.buf <- my.matches$Bufo %>%
  filter(queryID %in% keep.buf) %>%
  group_by(queryID) %>%
  slice_min(order_by = Evalue, n=1,with_ties = FALSE) %>% ungroup()

#extract list of queryIDs to remove (those with multiple matches)
drop.buf <- my.matches$Bufo %>%
  group_by(queryID) %>% tally() %>%
  filter(n>1) %>% pull(queryID)

#drop if query ID in list
dropped.buf <- my.matches$Bufo %>%
  filter(!queryID %in% drop.buf)

#merge the two
my.buf <- rbind(keep.buf,dropped.buf)

#replace in original list
my.matches$Bufo <- my.buf

#remove clutter
rm(my.buf,dropped.buf,drop.buf,keep.buf)

keep.gar <- my.matches$Gargarizans %>%
  filter(queryID %in% keep.gar) %>%
  group_by(queryID) %>%
  slice_min(order_by = Evalue, n=1,with_ties = FALSE) %>% ungroup()

drop.gar <- my.matches$Gargarizans %>%
  group_by(queryID) %>% tally() %>%
  filter(n>1) %>% pull(queryID)

dropped.gar <- my.matches$Gargarizans %>%
  filter(!queryID %in% drop.gar)

my.gar <- rbind(keep.gar,dropped.gar)

my.matches$Gargarizans <- my.gar

rm(my.gar,dropped.gar,drop.gar,keep.gar)

keep.lep <- my.matches$Leptodactylus %>%
  filter(queryID %in% keep.lep) %>%
  group_by(queryID) %>%
  slice_min(order_by = Evalue, n=1,with_ties = FALSE) %>% ungroup()

drop.lep <- my.matches$Leptodactylus %>%
  group_by(queryID) %>% tally() %>%
  filter(n>1) %>% pull(queryID)

dropped.lep <- my.matches$Leptodactylus %>%
  filter(!queryID %in% drop.lep)

my.lep <- rbind(keep.lep,dropped.lep)

my.matches$Leptodactylus <- my.lep

rm(my.lep,dropped.lep,drop.lep,keep.lep)

keep.rhi <- my.matches$Rhinella %>%
  filter(queryID %in% keep.rhi) %>%
  group_by(queryID) %>%
  slice_min(order_by = Evalue, n=1,with_ties = FALSE) %>% ungroup()

drop.rhi <- my.matches$Rhinella %>%
  group_by(queryID) %>% tally() %>%
  filter(n>1) %>% pull(queryID)

dropped.rhi <- my.matches$Rhinella %>%
  filter(!queryID %in% drop.rhi)

my.rhi <- rbind(keep.rhi,dropped.rhi)

my.matches$Rhinella <- my.rhi

rm(my.rhi,dropped.rhi,drop.rhi,keep.rhi)

For Platyplectrum we simply remove all those with more than 1 match

drop.pla <- my.matches$Platyplectrum %>%
  group_by(queryID) %>% tally() %>%
  filter(n>1) %>% pull(queryID)

dropped.pla <- my.matches$Platyplectrum %>%
  filter(!queryID %in% drop.pla)

my.matches$Platyplectrum <- dropped.pla

rm(dropped.pla,drop.pla)

Now let’s import the linkage map data to see which one of these occur on the linkage map as well. We will retain only one entry per locus

my.map <- read_csv(file = "../3_SexLinked/15_5_SummaryTable_MapAndSex.csv") %>%
  select(Locus,LG_locus,mean_Ave) %>% unique()

Finally we merge blast results for each assembly to the linkage map data

my.matches.lm <- lapply(my.matches, function(x){
  y <-  left_join(x,
                  my.map,
                  by=c("queryID"="Locus")) %>%
    select(queryID, searchID, LG_locus, everything()) %>%
    filter(LG_locus > 0)
  
  return(y)
})

And plot as pivot tables

qpvt(my.matches.lm$Bufo, "searchID", "LG_locus", "n()")

##           1   2  3  4  5  6  7  9  10  11  12  Total  
## SUPER_1    1     3                      1   1      6  
## SUPER_10               1                           1  
## SUPER_2    1           1  1  1                     4  
## SUPER_3    6              2                        8  
## SUPER_4       3              1                     4  
## SUPER_5             5                              5  
## SUPER_6    1  1     1                   3          6  
## SUPER_7                             3              3  
## SUPER_8                      1  3                  4  
## SUPER_9    1        1  2                           4  
## Total     10  4  3  7  4  3  3  3   3   4   1     45

qpvt(my.matches.lm$Gargarizans, "searchID", "LG_locus", "n()")

##                    1  2  3  4  5  6  7  9  10  11  Total  
## CM026465.1         1        1     2  1                 5  
## CM026466.1         1     5                             6  
## CM026467.1         4                                   4  
## CM026468.1            4  1                             5  
## CM026469.1         1        2     1  1                 5  
## CM026470.1         1  1                         3      5  
## CM026471.1                     2                       2  
## CM026472.1                  1               2          3  
## CM026473.1                     1     1  3              5  
## CM026474.1                     1                1      2  
## CM026475.1                                  1   1      2  
## JABFFT010000341.1           1                          1  
## Total              8  5  6  5  4  3  3  3   3   5     45

qpvt(my.matches.lm$Leptodactylus, "searchID", "LG_locus", "n()")

##                                        1  2  3  4  5  6  7  8  9  10  11  12  Total  
## ENA|CAMRHE010000001|CAMRHE010000001.1        1                                    1  
## ENA|CAMRHE010000004|CAMRHE010000004.1                                  1          1  
## ENA|CAMRHE010000006|CAMRHE010000006.1     1                                       1  
## ENA|CAMRHE010000017|CAMRHE010000017.1              1                              1  
## ENA|CAMRHE010000020|CAMRHE010000020.1           1                                 1  
## ENA|CAMRHE010000021|CAMRHE010000021.1                 1                           1  
## ENA|CAMRHE010000028|CAMRHE010000028.1                              1              1  
## ENA|CAMRHE010000029|CAMRHE010000029.1  1                                          1  
## ENA|CAMRHE010000034|CAMRHE010000034.1  1                                          1  
## ENA|CAMRHE010000036|CAMRHE010000036.1        1                                    1  
## ENA|CAMRHE010000040|CAMRHE010000040.1           1                                 1  
## ENA|CAMRHE010000043|CAMRHE010000043.1                                  1          1  
## ENA|CAMRHE010000051|CAMRHE010000051.1           1                                 1  
## ENA|CAMRHE010000062|CAMRHE010000062.1                 1                           1  
## ENA|CAMRHE010000070|CAMRHE010000070.1                              1              1  
## ENA|CAMRHE010000071|CAMRHE010000071.1                 1                           1  
## ENA|CAMRHE010000081|CAMRHE010000081.1        1                                    1  
## ENA|CAMRHE010000082|CAMRHE010000082.1           1                                 1  
## ENA|CAMRHE010000092|CAMRHE010000092.1     1                                       1  
## ENA|CAMRHE010000102|CAMRHE010000102.1  1                                          1  
## ENA|CAMRHE010000105|CAMRHE010000105.1           1                                 1  
## ENA|CAMRHE010000106|CAMRHE010000106.1                              1              1  
## ENA|CAMRHE010000109|CAMRHE010000109.1                                  1          1  
## ENA|CAMRHE010000112|CAMRHE010000112.1  1                                          1  
## ENA|CAMRHE010000133|CAMRHE010000133.1                    1                        1  
## ENA|CAMRHE010000137|CAMRHE010000137.1        1                                    1  
## ENA|CAMRHE010000142|CAMRHE010000142.1              1                              1  
## ENA|CAMRHE010000155|CAMRHE010000155.1              1                              1  
## ENA|CAMRHE010000195|CAMRHE010000195.1     1                                       1  
## ENA|CAMRHE010000209|CAMRHE010000209.1     1                                       1  
## ENA|CAMRHE010000266|CAMRHE010000266.1                                      2      2  
## ENA|CAMRHE010000302|CAMRHE010000302.1                                  1          1  
## ENA|CAMRHE010000323|CAMRHE010000323.1           1                                 1  
## ENA|CAMRHE010000341|CAMRHE010000341.1        1                                    1  
## ENA|CAMRHE010000361|CAMRHE010000361.1                       1                     1  
## ENA|CAMRHE010000381|CAMRHE010000381.1  1                                          1  
## ENA|CAMRHE010000392|CAMRHE010000392.1        1                                    1  
## ENA|CAMRHE010000448|CAMRHE010000448.1  1                                          1  
## ENA|CAMRHE010000456|CAMRHE010000456.1                       1                     1  
## ENA|CAMRHE010000459|CAMRHE010000459.1        1                                    1  
## ENA|CAMRHE010000473|CAMRHE010000473.1     1                                       1  
## ENA|CAMRHE010000518|CAMRHE010000518.1  1                                          1  
## ENA|CAMRHE010000580|CAMRHE010000580.1        1                                    1  
## ENA|CAMRHE010000591|CAMRHE010000591.1                          1                  1  
## ENA|CAMRHE010000648|CAMRHE010000648.1                                  1          1  
## ENA|CAMRHE010000692|CAMRHE010000692.1        1                                    1  
## ENA|CAMRHE010000733|CAMRHE010000733.1  1                                          1  
## ENA|CAMRHE010000802|CAMRHE010000802.1     1                                       1  
## ENA|CAMRHE010000835|CAMRHE010000835.1           1                                 1  
## ENA|CAMRHE010000844|CAMRHE010000844.1                    1                        1  
## ENA|CAMRHE010001411|CAMRHE010001411.1  1                                          1  
## Total                                  9  6  9  7  3  3  2  2  1   3   5   2     52

qpvt(my.matches.lm$Rhinella, "searchID", "LG_locus", "n()")

##                 1  2  3  4  5  6  7  9  10  11  12  Total  
## ONZH01002185.1     1                                    1  
## ONZH01002998.1           1                              1  
## ONZH01004089.1              1                           1  
## ONZH01004433.1                           1              1  
## ONZH01004595.1           1                              1  
## ONZH01005235.1           1                              1  
## ONZH01007232.1                    1                     1  
## ONZH01008736.1  1                                       1  
## ONZH01009117.1                    1                     1  
## ONZH01009633.1                    1                     1  
## ONZH01010448.1        1                                 1  
## ONZH01010689.1  1                                       1  
## ONZH01010706.1                    1                     1  
## ONZH01011743.1     1                                    1  
## ONZH01011827.1  1                                       1  
## ONZH01012283.1              1                           1  
## ONZH01012983.1     1                                    1  
## ONZH01013238.1                                   1      1  
## ONZH01013881.1           1                              1  
## ONZH01014197.1        1                                 1  
## ONZH01014232.1  1                                       1  
## ONZH01014800.1     1                                    1  
## ONZH01014979.1        1                                 1  
## ONZH01015224.1        1                                 1  
## ONZH01015597.1                               1          1  
## ONZH01015972.1                           1              1  
## ONZH01017199.1  1                                       1  
## ONZH01017616.1                               1          1  
## ONZH01018404.1                           1              1  
## ONZH01018720.1                 1                        1  
## ONZH01020851.1              1                           1  
## ONZH01021045.1                           1              1  
## ONZH01022040.1     1                                    1  
## ONZH01022772.1                       1                  1  
## ONZH01023712.1     1                                    1  
## ONZH01023775.1        1                                 1  
## ONZH01024890.1                               1          1  
## ONZH01027000.1     1                                    1  
## ONZH01028366.1                 1                        1  
## ONZH01028847.1                               1          1  
## ONZH01029597.1           1                              1  
## ONZH01029888.1     1                                    1  
## ONZH01030037.1                       1                  1  
## Total           5  8  5  5  3  2  4  2   4   4   1     43

qpvt(my.matches.lm$Platyplectrum, "searchID", "LG_locus", "n()")

##                    2  3  6  8  11  Total  
## JABWIB010009571.1        1             1  
## JABWIB010013348.1  1                   1  
## JABWIB010017146.1     1                1  
## JABWIB010017960.1           1          1  
## JABWIB010021123.1               1      1  
## JABWIB010065037.1  1                   1  
## Total              2  1  1  1   1      6

Now we create a new sstart and send column where sstart is always < than send, and add

my.matches.lm <- lapply(my.matches.lm, function(x){
  y <- x %>% mutate(sstart2 = case_when(sstart < send ~ sstart,
                                        sstart > send ~ send),
                    send2 = case_when(sstart < send ~ send,
                                      sstart > send ~ sstart),
                    length2 = send2 - sstart2)
  
  return(y)
})

Add 2kb per side to the matching region

my.matches.lm <- lapply(my.matches.lm, function(x){
  y <- x %>%
    mutate(sstart3 = sstart2 - 2000,
         send3 = send2 + 2000)
  
  return(y)
})

Now let’s export those in bed format (i.e., scaffoldID<tab>startPos<tab>endPos)

write_delim(my.matches.lm$Bufo %>%
              select(searchID,sstart3,send3),
            file = "../4_blast/2_blast/16_3_BufBuf_matches.bam",
            delim = "\t",
            col_names = FALSE)

write_delim(my.matches.lm$Leptodactylus %>%
              select(searchID,sstart3,send3),
            file = "../4_blast/2_blast/16_3_LepFal_matches.bam",
            delim = "\t",
            col_names = FALSE)

write_delim(my.matches.lm$Gargarizans %>%
              select(searchID,sstart3,send3),
            file = "../4_blast/2_blast/16_3_BufGar_matches.bam",
            delim = "\t",
            col_names = FALSE)

write_delim(my.matches.lm$Rhinella %>%
              select(searchID,sstart3,send3),
            file = "../4_blast/2_blast/16_3_RhiMar_matches.bam",
            delim = "\t",
            col_names = FALSE)

write_delim(my.matches.lm$Platyplectrum %>%
              select(searchID,sstart3,send3),
            file = "../4_blast/2_blast/16_3_PlaOrn_matches.bam",
            delim = "\t",
            col_names = FALSE)

And as tables for later use

write_delim(my.matches.lm$Bufo,
            file = "../4_blast/2_blast/16_3_BufBuf_matches.txt",
            delim = "\t")

write_delim(my.matches.lm$Leptodactylus,
            file = "../4_blast/2_blast/16_3_LepFal_matches.txt",
            delim = "\t")

write_delim(my.matches.lm$Gargarizans,
            file = "../4_blast/2_blast/16_3_BufGar_matches.txt",
            delim = "\t")

write_delim(my.matches.lm$Rhinella,
            file = "../4_blast/2_blast/16_3_RhiMar_matches.txt",
            delim = "\t")

write_delim(my.matches.lm$Platyplectrum,
            file = "../4_blast/2_blast/16_3_PlaOrn_matches.txt",
            delim = "\t")

3. Extract 4kb windows from reference assemblies

Now let’s extract the corresponding fasta sequences with the following command. Note, you’ll have to replace assembly.fa and matches.fa with the names of the respective matches and assembly files.

module load bedtools

bedtools getfasta -fi assembly.fa -bed matches.bam -fo matches.fa

4. Blast extracted regions to X. tropicalis reference genome

First, I download the X. tropicalis genome with the following command:

rsync --copy-links --recursive --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/004/195/GCA_000004195.4_UCB_Xtro_10.0/GCA_000004195.4_UCB_Xtro_10.0_genomic.fna.gz  ./3_blast/3_1_xtrop/

I then create a database

conda activate blast-2.11.0

makeblastdb -in GCA_000004195.4_UCB_Xtro_10.0_genomic.fna -title Xtro10 -dbtype nucl -out Xtro10 -parse_seqids

And finally query the 4kb windows I extracted from the hourglass tree frog genome onto the Xenopus genome.

for assembly in BufBuf BufGar LepFal PlaOrn RhiMar
do
  blastn -query 2_blast/2_2_matches/16_4_${assembly}_matches.fa -db 3_blast_XenLaevis/3_1_xtrop/Xtro10 -out 3_blast_XenLaevis/16_5_${assembly}2XenTrop.txt -evalue 1e-15 -outfmt 6
done

Now, before we merge the matches between the intermediate genomes and Xenopus, let’s retain only the best match if the difference is 5 orders of magnitude

Buf2Xen <- read_delim("../4_blast/2_blast/16_5_BufBuf2XenTrop.txt",
                          delim = "\t",
                          col_names = c("queryID","searchID",
                                        "Perc","Length",
                                        "Diff","Gaps",
                                        "qStart","qEnd",
                                        "sstart","send",
                                        "Evalue","score"))

Gar2Xen <- read_delim("../4_blast/2_blast/16_5_BufGar2XenTrop.txt",
                          delim = "\t",
                          col_names = c("queryID","searchID",
                                        "Perc","Length",
                                        "Diff","Gaps",
                                        "qStart","qEnd",
                                        "sstart","send",
                                        "Evalue","score"))

Lep2Xen <- read_delim("../4_blast/2_blast/16_5_LepFal2XenTrop.txt",
                          delim = "\t",
                          col_names = c("queryID","searchID",
                                        "Perc","Length",
                                        "Diff","Gaps",
                                        "qStart","qEnd",
                                        "sstart","send",
                                        "Evalue","score"))

Pla2Xen <- read_delim("../4_blast/2_blast/16_5_PlaOrn2XenTrop.txt",
                          delim = "\t",
                          col_names = c("queryID","searchID",
                                        "Perc","Length",
                                        "Diff","Gaps",
                                        "qStart","qEnd",
                                        "sstart","send",
                                        "Evalue","score"))

Rhi2Xen <- read_delim("../4_blast/2_blast/16_5_RhiMar2XenTrop.txt",
                          delim = "\t",
                          col_names = c("queryID","searchID",
                                        "Perc","Length",
                                        "Diff","Gaps",
                                        "qStart","qEnd",
                                        "sstart","send",
                                        "Evalue","score"))



my.xen.list <- list(Buf2Xen, Gar2Xen, Lep2Xen, Pla2Xen, Rhi2Xen)
names(my.xen.list) <- c("Bufo","Gargarizans","Leptodactylus","Platyplectrum","Rhinella")
rm(Buf2Xen,Gar2Xen,Lep2Xen,Pla2Xen,Rhi2Xen)

lapply(my.xen.list, function(x){
  x %>% group_by(queryID) %>%
    summarise(firstE = min(Evalue),
              scndE = nth(Evalue, 2, order_by = Evalue)) %>%
    drop_na()
})

## $Bufo
## # A tibble: 8 × 3
##   queryID                       firstE     scndE
##   <chr>                          <dbl>     <dbl>
## 1 SUPER_1:709086661-709090729 0        0        
## 2 SUPER_2:200820038-200824092 0        0        
## 3 SUPER_2:645951670-645955731 0        0        
## 4 SUPER_3:147300479-147304544 3.11e-67 4.02e- 66
## 5 SUPER_3:366739305-366743374 0        1.63e-164
## 6 SUPER_3:41116371-41120437   0        0        
## 7 SUPER_4:536096509-536100577 0        7.03e- 34
## 8 SUPER_5:378670322-378674388 0        0        
## 
## $Gargarizans
## # A tibble: 6 × 3
##   queryID                          firstE    scndE
##   <chr>                             <dbl>    <dbl>
## 1 CM026465.1:186712537-186716598 0        0       
## 2 CM026465.1:316786201-316790247 0        0       
## 3 CM026467.1:518606935-518611000 6.41e-99 1.39e-95
## 4 CM026468.1:462842942-462847010 0        1.41e-85
## 5 CM026470.1:387945366-387949426 0        0       
## 6 CM026473.1:186541833-186545901 0        0       
## 
## $Leptodactylus
## # A tibble: 8 × 3
##   queryID                                                   firstE     scndE
##   <chr>                                                      <dbl>     <dbl>
## 1 ENA|CAMRHE010000021|CAMRHE010000021.1:10133088-10137157 0        0        
## 2 ENA|CAMRHE010000062|CAMRHE010000062.1:7623445-7627507   2.42e-63 9.28e- 18
## 3 ENA|CAMRHE010000209|CAMRHE010000209.1:2848462-2852526   2.53e-33 9.15e- 28
## 4 ENA|CAMRHE010000302|CAMRHE010000302.1:700758-704826     2   e-19 2   e- 19
## 5 ENA|CAMRHE010000381|CAMRHE010000381.1:852292-856359     1.07e-96 8.59e- 73
## 6 ENA|CAMRHE010000591|CAMRHE010000591.1:277239-281292     0        0        
## 7 ENA|CAMRHE010000648|CAMRHE010000648.1:397223-401288     0        1.23e-180
## 8 ENA|CAMRHE010000802|CAMRHE010000802.1:280969-285021     3.3 e-22 1.54e- 20
## 
## $Platyplectrum
## # A tibble: 0 × 3
## # … with 3 variables: queryID <chr>, firstE <dbl>, scndE <dbl>
## 
## $Rhinella
## # A tibble: 1 × 3
##   queryID                      firstE    scndE
##   <chr>                         <dbl>    <dbl>
## 1 ONZH01010706.1:18505-22573 2.55e-28 1.98e-24

Once again, I inspected them visuall in excel to assess which matches were 4 order than magnitude different from the second best match. Below is a list of all queryID that DID NOT match the criteria (i.e., first and second match less than 4 order of magnitude different).

buf2remove <- c("SUPER_1:709086661-709090729","SUPER_2:200820038-200824092",
                "SUPER_2:645951670-645955731","SUPER_3:147300479-147304544",
                "SUPER_3:41116371-41120437","SUPER_5:378670322-378674388")
gar2remove <- c("CM026465.1:186712537-186716598","CM026465.1:316786201-316790247",
                "CM026470.1:387945366-387949426","CM026473.1:186541833-186545901")
lep2remove <- c("ENA|CAMRHE010000021|CAMRHE010000021.1:10133088-10137157",
                "ENA|CAMRHE010000302|CAMRHE010000302.1:700758-704826",
                "ENA|CAMRHE010000591|CAMRHE010000591.1:277239-281292",
                "ENA|CAMRHE010000802|CAMRHE010000802.1:280969-285021")
rhi2remove <- c("ONZH01010706.1:18505-22573")

So let’s remove those, and for the remanining matches retain the top one only

my.xen.list.2 <- my.xen.list

#we filter to retain only those we selected as confident matches, and then retain the first hit by e-value
my.xen.list.2$Bufo <- my.xen.list$Bufo %>%
  filter(!queryID %in% buf2remove) %>%
  group_by(queryID) %>%
  slice_min(order_by = Evalue, n=1,with_ties = FALSE)

my.xen.list.2$Gargarizans <- my.xen.list$Gargarizans %>%
  filter(!queryID %in% gar2remove) %>%
  group_by(queryID) %>%
  slice_min(order_by = Evalue, n=1,with_ties = FALSE)

my.xen.list.2$Leptodactylus <- my.xen.list$Leptodactylus %>%
  filter(!queryID %in% lep2remove) %>%
  group_by(queryID) %>%
  slice_min(order_by = Evalue, n=1,with_ties = FALSE)

my.xen.list.2$Rhinella <- my.xen.list$Rhinella %>%
  filter(!queryID %in% rhi2remove) %>%
  group_by(queryID) %>%
  slice_min(order_by = Evalue, n=1,with_ties = FALSE)

Let’s save as text files

write_delim(my.xen.list.2$Bufo,
            file = "../4_blast/3_blast_XenLaevis/16_6_BufBuf2XenTrop.txt")
write_delim(my.xen.list.2$Gargarizans,
            file = "../4_blast/3_blast_XenLaevis/16_6_BufGar2XenTrop.txt")
write_delim(my.xen.list.2$Leptodactylus,
            file = "../4_blast/3_blast_XenLaevis/16_6_LepFal2XenTrop.txt")
write_delim(my.xen.list.2$Platyplectrum,
            file = "../4_blast/3_blast_XenLaevis/16_6_PlaOrn2XenTrop.txt")
write_delim(my.xen.list.2$Rhinella,
            file = "../4_blast/3_blast_XenLaevis/16_6_RhiMar2XenTrop.txt")

Now we have all the info we need. Let’s clear the R environment, and then assess to what xenopus chromosomes our linkage groups match to.

5. Match Linkage groups to chromosomes

rm(list = ls())

Read in matches between L. serrata rad tags and intermediate assemblies assembly

my.Buf <- read_delim("../4_blast/2_blast/16_3_BufBuf_matches.txt", delim = "\t") %>%
  add_column(Comparison="Bufo")
my.Gar <- read_delim("../4_blast/2_blast/16_3_BufGar_matches.txt", delim = "\t") %>%
  add_column(Comparison="Gargarizans")
my.Lep <- read_delim("../4_blast/2_blast/16_3_LepFal_matches.txt", delim = "\t") %>%
  add_column(Comparison="Leptodactylus")
my.Pla <- read_delim("../4_blast/2_blast/16_3_PlaOrn_matches.txt", delim = "\t") %>%
  add_column(Comparison="Platyplectrum")
my.Rhi <- read_delim("../4_blast/2_blast/16_3_RhiMar_matches.txt", delim = "\t") %>%
  add_column(Comparison="Rhinella")

my.df <- rbind(my.Buf,my.Gar,my.Lep,my.Pla,my.Rhi)
rm(my.Buf,my.Gar,my.Lep,my.Pla,my.Rhi)

Read in matches between intermediate assemblies and Xenopus assembly

Buf2Xen <- read_delim("../4_blast/3_blast_XenLaevis/16_6_BufBuf2XenTrop.txt",
                          delim = " ") %>% add_column(Comparison="Bufo") %>%
  rename(queryID2=queryID)

Gar2Xen <- read_delim("../4_blast/3_blast_XenLaevis/16_6_BufGar2XenTrop.txt",
                          delim = " ") %>% add_column(Comparison="Gargarizans") %>%
  rename(queryID2=queryID)

Lep2Xen <- read_delim("../4_blast/3_blast_XenLaevis/16_6_LepFal2XenTrop.txt",
                          delim = " ") %>% add_column(Comparison="Leptodactylus") %>%
  rename(queryID2=queryID)

Pla2Xen <- read_delim("../4_blast/3_blast_XenLaevis/16_6_PlaOrn2XenTrop.txt",
                          delim = " ") %>% add_column(Comparison="Platyplectrum") %>%
  rename(queryID2=queryID)

Rhi2Xen <- read_delim("../4_blast/3_blast_XenLaevis/16_6_RhiMar2XenTrop.txt",
                          delim = " ") %>% add_column(Comparison="Rhinella") %>%
  rename(queryID2=queryID)



my.df2 <- rbind(Buf2Xen,Gar2Xen,Lep2Xen,Pla2Xen,Rhi2Xen)
rm(Buf2Xen,Gar2Xen,Lep2Xen,Pla2Xen,Rhi2Xen)

Now add a column that will match the search sequence id in the Intermediate assembly to Xenopus comparison:

my.df <- my.df %>%
  mutate(queryID2 = paste0(searchID,":",sstart3,"-",send3))

Now we can merge with the matches to the Xenopus genome by queryID

my.df3 <- left_join(my.df2, my.df, by="queryID2",
                  suffix=c(".Xen",".Int"))

my.df3 <- my.df3 %>%
  select(queryID2,searchID.Xen,
         LG_locus, queryID, searchID.Int,
         Comparison.Xen, Comparison.Int)

Let’s visualise matches between our linkage map and the Xenopus tropicalis chromosome level assembly.

qpvt(my.df3, "searchID.Xen", "LG_locus", "n()")

##                 1  2  3  4  5  6  9  10  11  Total  
## AAMC04000030.1                    1              1  
## CM004443.2         1           4                 5  
## CM004444.2      4     1                          5  
## CM004445.2            2                   1      3  
## CM004446.2      1           5                    6  
## CM004447.2         8           1                 9  
## CM004448.2               5                       5  
## CM004449.2      1                                1  
## CM004451.2                            2          2  
## Total           6  9  3  5  5  5  1   2   1     37

There seems to be some synteny. But careful! Some of these values are inflated as the same locus matched to multiple intermediate assemblies and thus now shows as having more matches to the Xenopus assembly.

Let’s filter to retain one hit per locus (multiple hits per locus are concordant). Note, I checked that when having multiple matches per locus, they all mapped to the same X. tropicalis chromosome.

my.df4 <- my.df3 %>% distinct(queryID, .keep_all = TRUE)

let’s re-run it now to check the real unique matches

qpvt(my.df4, "searchID.Xen", "LG_locus", "n()")

##                 1  2  3  4  5  6  9  10  11  Total  
## AAMC04000030.1                    1              1  
## CM004443.2         1           1                 2  
## CM004444.2      3     1                          4  
## CM004445.2            2                   1      3  
## CM004446.2      1           2                    3  
## CM004447.2         3           1                 4  
## CM004448.2               2                       2  
## CM004449.2      1                                1  
## CM004451.2                            1          1  
## Total           5  4  3  2  2  2  1   1   1     21

6. Summary

It does appear that there is some level of synteny between the linkage maps produced here and the 6 genome assemblies we investigated (Bufo bufo, Bufo gargarizans, Leptodactylus fallax, Platyplectrum ornatum, Rhinella marina and Xenopus tropicalis).

Unfortunately against the Xenopus chromosome level assembly we only have 21 unique hits, which, spread across 13 linkage groups and with possible re-arrangements and/or errors, make it hard to confirm whether the observed synteny represents a true signal (e.g., LG2 in our map is homologous to Chromosome CM004447 in Xenopus) or whether we obtained three hits for that linkage group by chance.