Phylogenetic reconciliation supports a methanogenic ancestor of the Archaea and a derived origin for host-associated lineages

Contributors
  • Alexandros Stamatakis
  • Gergely Szöllősi
  • Tom Williams
  • Anja Spang
Publication date 15-10-2025
Description
The phylogeny of the Archaea continues to be revisited and revised as new groups are discovered and phylogenetic methods improve, but key questions about their early evolution remain. It has been suggested that the root of the Archaea may lie on, or potentially within, any of three major groups - the Euryarchaeota, TACK+Asgard clade, and DPANN, the last of which includes many host-associated and genome-reduced lineages. These root hypotheses make starkly different predictions about the nature of early archaeal evolution: for example, a root on or within DPANN might suggest a small-genome and host-associated ancestor, with methanogenesis, the hallmark metabolism of the Archaea, evolving later. Here, we investigate the position of the archaeal root and the nature of the last archaeal common ancestor using a range of phylogenetic approaches, including the best available site- and branch-heterogeneous substitution models, and new gene tree-species tree reconciliation models that capture changes in rates of gene duplication, loss and transfer across the phylogeny. Our analyses converge on a narrow archaeal root region at/near the base of the Euryarchaeota, supporting hypotheses in which the Last Archaeal Common Ancestor (LACA) was a complex, free-living (hyper-)thermophilic methanogen. We recover DPANN as the sister group to TACK and Asgard archaea, and suggest that their genome evolution has been characterised by episodes of genome streamlining and expansion, driven by gene loss and transfer.    Repository Contents   0_scripts includes custom Python/bash/Perl scripts that are used in this project.   1_taxa_set includes predicted protein sequences and HMMSEARCH results necessary for inferring the single gene trees. An annotation workflow is available from GitHub (https://github.com/ndombrowski/Genome_annotations).  faa folder includes the predicted protein sequences of our focal set of 257 taxa, and the initial set with an additional 256 DPANN genomes. arcogs_hmmer_raw folder includes hmmsearch results of the arCOG database of the initial set that was used to generate single gene trees for reconciliation. arCOG_DB folder includes the arCOG hmm database.   2_species_tree includes all the inferred species trees and results of the rooting analyses in this project. 1_eLifeMarkers_outgroup 0_faa includes alignments and trimmed alignments for Archaea and Bacteria used in the outgroup rooting. 1_constraints folder includes constraint tree topologies used for searching a maximum-likelihood constrained topology. 2_sitelhs folder includes Site-wise likelihoods of unconstrained and constrained tree searches using the LG+C60+F+R model. 3_treefile includes the best-known maximum likelihood tree under the LG+C60+F+R model. This tree is visualised in Fig. S1. 4_qmd includes the commands used in inferring the species tree, performing constrained species tree searching, and the AU test.  2_UndinMarkers 0_faa includes alignments and trimmed alignments for Archaea used to infer an unrooted Archaea species tree. 1_species_tree_LG_C60_F_R includes the best-known maximum likelihood tree under the LG+C60+F+R model using 45 Undin markers. This tree is visualised in Fig. 1. The chi2 folder contains treated alignments, which were generated by removing the top 10%-80% compositionally biased sites using the chi-square test. These trees are visualised in Fig. S2. 2_non-reversible contains the inferred best-known maximum likelihood species tree under the NONREV+G model, with rootstrap support, and the approximately unbiased test results for every branch of the ML tree. These results are associated with Fig. S4 and Table S6. 3_GFmix contains the re-rooted species tree corresponding to the 15 test root positions (selected from literature and between major groups of archaea), site-wise likelihoods under the GFmix model, inputs for the GFmix model and puzzle file for the AU test.  4_qmd  includes the commands used in inferring the species tree, performing constrained species tree searching, and the AU test.  3_Baker_et_al_2025 includes the files re-analysed from Baker et al. 2025 with regard to the placement of Altiarchaeota and rooting analyses. 0_Annotation. NM126_org includes the original file downloaded from 10.6084/m9.figshare.26798065.v1. NM126_renamed includes faa files with the marker named added to the sequence header. 1_Bacteria_Inspection. 0_faa_Bacteria_Annotation includes the faa files and the annotation for the bacterial homologs of the NM126 marker genes. 1_add_bacterial_homologs_to_NM126 includes the alignments, trimmed alignments, tree files and PDFs generated for archaeal domain monophyly. 2_NM126_ranking includes alignments, trimmed alignments, and single-gene trees inferred under the best-fitting models using data from NM126_org. best_fitting_trees_renamed_for_ranking includes the renamed single-gene trees for marker ranking based on the GTDB class level. These results are reported in Table S5. 3_Concatenation includes concatenation and inferred phylogeny of Markers_that_fit_archaeal_monophyly, Markers_that_violate_archaeal_monophyly, 50%_top_ranked markers and 50%_bottom_ranked markers. These are reported in Fig. S3. GFmix_based_on_original_Data includes rooted species trees, '.iqtree', '.phy', necessary for the GFmix model, and the site-wise likelihood outputs from the GFmix model. These are reported in Table S8. 4_qmd includes codes used for checking redundancy, generating a single gene tree for visualisation, ranking markers, and example commands to run GFmix.  FileList includes blast output for checking redundancy in the markers. 3. Reconciliation includes the species tree, workflow for inferring single-gene trees, predicted protein sequences for inferring single-gene trees, UFBOOT trees/CCP files, model parameters, and per-family likelihoods for reconciliations under different reconciliation models, for investigating root placements. This_study includes data generated in this study based on the focal 257 dataset. 0_protein_seqs_initial includes predicted protein sequences after separating fused arCOG domains, gappy sequences and long branching sequences from the initial 513 taxa set. 1_protein_seqs includes predicted protein sequences downsampled from the initial 513-taxon set to the focal 257-taxon dataset. Alignments and trimmed alignments are included, respectively. 2_phylogeny includes guide trees inferred under the LG+G model, best-fitting models found for these families, and the best-known maximum-likelihood tree inferred with PMSF approximation.  3_UFBOOTS includes the ultraboostrap single gene tree distributions inferred under the best-fitting models. These are the inputs for the reconciliation models.  4_species_tree includes 15 differently rooted species trees. 5_reconciliation models include the output of the species tree, per-family likelihoods, and optimised model parameters from AleRax under different models, i.e., global, DPANN_L, DPANN_DL, DPANN_TL, DPANN_DTL, DTL_br1, DTL_br1_O, DTL_br2 and family-wise. The fraction_missing file is provided. Model parameters to parameterise different branches are provided in the folder model_parameters. These data are reported in Tables S9, S10, S11 and S12 as well as Fig. S5. 6_qmd includes workflows for generating ultrafast bootstrap single-gene trees and reconciliation models. Williams_et_al_2017 include conditional clade probability files (CCPs) and different rooted phylogenies, which are reanalysed here using various reconciliation models.  reconciliation models: include global, branch-wise and family-wise reconciliation models fitted to the 60 genome taxa set. Model parameter files are provided in model_parameters. These data are reported in Tables S13 and S14.  4. Ancestral reconstruction includes reconciliation results, model parameters, and event counts for ancestral reconstructions. These data are reported in Table S15-S22, S24-S27, S29-S30.  Euryarchaeota_root model_parameters: optimised model parameters for ancestral reconstruction under the Euryarchaeota root.  reconciliations:  perspecies_eventcount_per_family_summary.tsv.gz: summary of reconciliation events for all families.  perspecies_eventcount.txt: sum of different reconciliation events by branches. transfers_per_family_summary.tsv.gz: summary of transfer frequency from donor to recipient for each family. species_trees: starting_species_tree.newick: species tree in Newick format. The nodes of this tree are labelled and can be used to link the node names provided in Tables S19, S23, S28, S29, and S30. This tree is also used in Fig. 2, Fig. S7 and Fig. S19A. MHH_root model_parameters: optimised model parameters for ancestral reconstruction under the MHH root.  reconciliations:  perspecies_eventcount_per_family_summary.tsv.gz: summary of reconciliation events for all families.  perspecies_eventcount.txt: sum of different reconciliation events by branches. transfers_per_family_summary.tsv.gz: summary of transfer frequency from donor to recipient for each family. species_trees: starting_species_tree.newick: species tree in Newick format. The nodes of this tree are labelled and can be used to link the node names provided in Tables S19, S23, S28, S29, and S30. This tree is also used in Fig. 2, Fig. S8, and Fig. S19B. DPANN_root model_parameters: optimised model parameters for ancestral reconstruction under the DPANN root.  reconciliations:  perspecies_eventcount_per_family_summary.tsv.gz: summary of reconciliation events for all families.  perspecies_eventcount.txt: sum of different reconciliation events by branches. transfers_per_family_summary.tsv.gz: summary of transfer frequency from donor to recipient for each family. species_trees: starting_species_tree.newick: species tree in Newick format. The nodes of this tree are labelled and can be used to link the data in perspecies_eventcount_per_family_summary.tsv.gz, which is used for comparison with the last archaeal common ancestor and the Euryarchaeota root, as well as MHH_root. A table is provided here that contains the presence probabilities of nodes of interest used in Figs. S15 and S16. TACKA+Asgard_root model_parameters: optimised model parameters for ancestral reconstruction under the TACK+Asgard root.  reconciliations:  perspecies_eventcount_per_family_summary.tsv.gz: summary of reconciliation events for all families.  perspecies_eventcount.txt: sum of different reconciliation events by branches. transfers_per_family_summary.tsv.gz: summary of transfer frequency from donor to recipient for each family. species_trees: starting_species_tree.newick: species tree in Newick format. The nodes of this tree are labelled and can be used to link the data in perspecies_eventcount_per_family_summary.tsv.gz. which is used for comparison with the last archaeal common ancestor and the Euryarchaeota root, as well as MHH_root. A table is provided here that contains the presence probabilities of nodes of interest used in Figs. S15 and S16.    
Publisher Zenodo
Organisations
  • Faculty of Science (FNWI) - Institute for Biodiversity and Ecosystem Dynamics (IBED)
Document type Dataset
DOI https://doi.org/10.5281/zenodo.17360805
Permalink to this page
Back