Adding Your Databases to DRAM2
DBKits are how DRAM2 interacts with databases, and the data that is generated by searching genome collections against them. Here you will learn how to create them.
In DRAM2 you can add DBKits like plugins if you know just a small amount of python, and the basics of the Object Oriented Programming principals. Even if you don’t meet the criteria above you may still be able to succeed in fallowing this protocol in order to implement a new database. All you have to do is go through the fallowing steps and examples making a few simple choices, and you should get a workable new database.
In this example we are going to use the CANT-HYD database. CANT-HYD is fully integrated into dram but lets pretend it is not for this example and implement it as a plugin. This plugin will work almost 100% like the built in. The only difference is that it can be installed separate.
The full code for this example lives in the DRAM2 repository here. You can reference its structure and may want to copy it and edit it in order to create your own DBKit.
Fallow these steps
1. Look at your database and decide how it will be searched and what you need from it.
In DRAM2 we mostly use 2 tools to search genes against databases HMMER and MMSeqs2. HMMER is usually used for HMM files and MMseqs is usually used for blast style searches of FASTA files. Depending on the data that you want to use you will need to use slightly different tools.
In our example case CANT-HYD has both HMMS and FASTAs. We are using both but we will make 2 separate tools, one for each. We could make one but this way is cleaner and helps to optimize dram interfamily and externally.
Having selected you data types and thus what seach you will use, you will need to formulate a list of what files you will need and where they can be taken from.
If you are planing to use HMMER and hmms you need:
All the HMMs in question
Optionaly, cutoffs or other meta data
- If you are planing to use MMSeqs and amino acid FASTAs you need:
All the Faa files’s in question
Optionaly, cutoffs or other meta data
So it is not complicated. Keep all this data in mined for the next step and an have a document with links to the data. If you are making your own data sets you probably want to host it on GitHub or Zenodo so you can version it.
2. Make, or copy, a template of the plugin structure.
The DBKits in DRAM2 are implemented as name space packages in python. You don’t need to know what that actually means, but what it functionally means is that your DBKit is a combination of a file and a folder with very specific requirements for naming and formatting. Your DBKit directory must have the structure below or it will not work with DRAM2. Places where some customization is aloud will be marked with “<>”.
The layout is:
- <some dir where you are working>/
setup.py dram2/
- db_kits/
- <unique_name_only_leters_and_underscores>/
__init__.py
- <another_db_kit>/
__init__.py
- <some dir where you are working>/
setup.py dram2/
- db_kits/
- <unique_name_onlyleters_and_underscores>/
__init__.py
It is key that you don’t put other files in the dram2 folder other than what is shown. You can have as many DBKits in one directory as you want, but they all need a unique name. Names should be relevant to the dbs, and have only letters and “_”. Ideally names should end in “_kit”, for example “cool_db_kit”.
We are adding 2 db kits so our structure is:
- example_dbkit_plugin/
setup.py dram2/
- db_kits/
- cant_hyd_hmm/
__init__.py
- cant_hyd_blast/
__init__.py
Now we can start placing code in the __init__.py files.
2. Make a class that inherent from the abstract DBKit class with all the defaults.
The code should look like this:
from os import path, stat
from shutil import move, rmtree
from pathlib import Path
import logging
import pandas as pd
from dram2.db_kits.utils import DBKit,
from dram2.utils import download_file, run_process, Fasta
class MyKit(DBKit):
"""A tool implement a database in DRAM"""
name = "" # the name as used by dram all lowercase leters and _
formal_name: str = "" # The actual name, any ascii caricter you want
version: str = "" # the version as a string
citation: str = "" # formated citation in ascii
search_type: str = "" # describe the type hmm_style or blast_style
has_genome_summary: bool = True # if there is a geneome summary acociated say so.
location_dictionary: dict = {}
def download(self):
"""Download your raw data, return a full location dictionary."""
pass
def setup(self) -> dict:
"""Do whatever you need to process the data at the locations provided."""
pass
def get_genome_summary(self) -> Path:
"""Get the ids from a complete annotations pandas DataFrame."""
pass
def search(self, fasta: Fasta) -> pd.DataFrame | pd.Series:
"""Perform a search, be that HMM, Blast or something else."""
pass
def load_dram_config(self):
"""Extract data from the larger DRAM config yaml"""
pass
In the code above we lay out the bare bones of a new DBKit and there are a few things to note here. To start, note that I am importing tools from the dram2 package. So this package depends on DRAM2 being installed and we will see more of that latter. Another thing, is that the class inherent from the DBKit class so it will have its tools. If you remove any of the methods this class will not work, these methods are enforced by the abstract class.
3. Fill in the class variables.
You can see at the top of the class that there are some class variables that need to have values inserted. It should be easy if you understand your database to fill in these variables.
In the case of the CANT-HYD hmm it looks like:
In the case of the CANT_HYD blast it looks like
4. Downloading and setting up the data.
The first step the user will take is probably to set up your database, so lets setup this database. We need to download the data and set it up for DRAMs tools to use, and we have the tools to do soe here with the download and setup methods. Both these methods are called one after the other when the user exicutes the dram2 build db command
location_collumns: list[str] = [“hmms”, “faa”, “hmm_scores”, “faa_scores”]