VEuPathDB User stories

This topic is a place to collect “user stories” for analyses that used be done at VEuPathDB, in an effort to allow us to construct similar tools/workflows at BRC Analytics. Please reply to this post to add yours.

1 Like

Here’s a workflow that came from an email conversation with a user:

For me, using PlasmoDB or PiroplasmDB, I would when I look at a gene of interest go about it the following way:

  1. Look at the genomic location (subtelomeric, which other genes are neighbours)
  2. Look at orthologues within the genus/ across the piroplasms and whether these genes are syntenic or not.
  3. Look at a subset of the transcripomic datasets (especially for Pf where there were a lot of different data sets) in order to figure out whether the transcription profile in blood stages is the way I would expect for an invasion related gene.
  4. Look for SNPs synonymous vs non-synonymous
  5. Look at phenotypics (PlasmoGem or piggy back data) to see whether this gene could be straight knocked out or not. As I am looking for essential invasion related genes.
  6. Look at the Alphafold linkout to see if the structure is predicted to a high degree of confidence or not, or only the core of the protein.
  7. Look at what domains were identified in the protein, PFAM, Interpro; does it have a predicted SP, TM? Looking also at the hydrophilicity/hydrophobicity plot to see whether I think they forgot to call a TM/SP as these can be untypical in the Apicomplexan phylum.
  8. Look at proteomic data (is there masspec evidence for expression in schizonts, merozoites – and is this gene only expressed here or also in other infectious stages such as sporozoites?)
  9. Download the sequences (ORF +/- 1000bp; protein sequence)

My lab works on Plasmodium and Babesia erythrocyte invasion, hence the interest in PlasmoDB and PiroplasmDB previously. I am also used the wider VEupathDB website to BLAST protein sequences across all pathogens listed on this website.

I wish we can get more examples like this. However, this is already a great starting material for writing “how to do this in a new system” section for the brc-analytics website. This will be coming soon.

1 Like

I am now retired but still sometimes access the database when I get papers to review, normally to find out whether the authors have checked thoroughly what is already known about the genes/proteins they are writing about. For that the user comments are especially useful since they point out previous publications, some of which wouldn’t necessarily turn up on a Pubmed search - including preprints or mention of a protein in the context of a paper about something else (e.g. X associates with, or is regulated by, Y). It is not unusual to discover that the authors of submitted papers have indeed not done their homework.

Before I retired (30 months ago) we were routinely analysing large transcriptomic and proteomic datasets and needed to find functional relationships in our “hits” - usually proteins detected, differentially regulated mRNAs, and mRNAs bound to proteins. I used the reference T. brucei genome on TritrypDB as the basis, with 427or EATRO VSGs added in if appropriate (since we were actually working with those two strains). I started with tritrypDB annotations and made a large Excel Table so we could all do rapid lookups of everything we wanted. The Table also included numerous raw published datasets from the relevant supplementary Tables (especially. tryptag, various transcriptomes, and the RNAi screen). It was also essential to check the user comments to make sure annotations were suitable. Since normal GO terms are still of limited use for kinetoplastids I used info from Tritryp DB to assign my own categories for functions. I would regularly update my Table from TritrypDB but would also upload user comments as appropriate… I also routinely looked at orthologues in other tritryps, at protein domains (especially for proteins of unknown function) and at the transcriptome alignments and histone variants and modifications to understand likely mRNA structures and detect proximity to promoter or termination regions. For BLAST analyses to look for a trypanosome orthologue/homologue of a gene rom another species I’d use Tritryp DB, but for sequence retrieval and alignments across evolution we obviously had to look elsewhere.

We used our own pipelines for sequence alignment and analysis because when we started out, we had to. Galaxy and other standard pipelines for eukaryotes were unsuitable since it was impossible to get around Cufflinks (or similar), which eliminated all multicopy genes so took out most abundant trypanosome housekeeping mRNAs. Subsequently Galaxy was adapted but by that time we had a lot of datasets, which we wanted to be able to compare without having to worry about artefacts caused by varying methodology.
If we wanted to visualise our data we’d learn how to do it, usually in RStudio. We didn’t do network analyses.

2 Likes

All the datasets mentioned above abby brc_help are extfremely important for our research, and we were heavily dependent on PlasmoDB for those. In addition to these, tools such as “protein motif identifier”, “protein localization” (MitoProt/ ApicoAP) were also very useful. Therefore, I strongly erncourage to add these features to the new database.

1 Like

Similar work flow to the the emailed response which is repeated below. But I also often search for sets of genes and then download the information below as a cvs file to be able to compare transcription patterns and other descriptions and get the previous identifiers.
I also link to the genome browser to compare transcription, chromatin modifications/accessibility, predicted AP2 sites as well as SNP info.

  1. Look at the genomic location (subtelomeric, which other genes are neighbours)
  2. Look at orthologues within the genus/ across the plasmodium and whether these genes are syntenic or not.
  3. Look at a subset of the transcripomic datasets (especially for Pf where there were a lot of different data sets) in order to figure out stage specificity
  4. Look for SNPs synonymous vs non-synonymous
  5. Look at phenotypics (PlasmoGem or piggy back data) to see whether this gene could be straight knocked out or not.
  6. Link out to metabolic pathways.
  7. Look at the Alphafold link out to see if the structure is predicted to a high degree of confidence or not, or only the core of the protein.
  8. Look at what domains were identified in the protein, PFAM, Interpro; does it have a predicted SP, TM? Looking also at the hydrophilicity/hydrophobicity plot to see whether I think they forgot to call a TM/SP as these can be untypical in the Apicomplexan phylum.
  9. Look at proteomic data (is there masspec evidence for expression in schizonts, merozoites – and is this gene only expressed here or also in other infectious stages such as sporozoites?)
  10. Download the sequences (ORF +/- 1000bp; protein sequence)
1 Like

Here’s another email sent to me. They indicated, at a minimum, they would need:

  • full access to the “Genome Browser” interactive tools is primordial: this, with the sequences of the gen of interest, are crucial to design molecular constructs for all of our gene/proteins of interests (Auxin induced iKD, TetON…all of the current tools that make biological characterization possible)

  • full information on sequences (gene, with and without introns, mRNA, proteins sequences) are ALL mandatory

-full information “phenotype”: we NEED all phenotypic scores as provided by Siddik et al. 2016 (and more)

  • full information on protein features: domains, transmembrane…all

  • full information on epigenetic mars, transcripts, phosphoproteomes…all interactive data from depository need to be accessible

  • predicted localisation (LOPIT data) is crucial too

  • BLAST tools are crucial as well

  • prediction via PlasmoAP and PATS are very much needed

We use the databases for a variety of things. Most commonly (daily basis) is to look for gene ID, and all associated information (protein features (TMD/ Nr of TMDs, signal peptide, sequences, phenotype, transcription profiles in various strains, orthologues, paralogues etc). Having all this information ready on 1 website, instead of needing to gather this information from the different web services already saves a huge amount of researchers time (=money). Quite often then we use searches to look ar a subset of genes with a certain feature (5-12 TMD to identify transporters, signal peptide containing proteins etc).

Easy access to cDNA/ genome/ mRNA sequences in one place. Since the websites are down I tried to gather this information “quickly” from other websites. And it was not straightforward/ user friendly. It may take a while to know where to look best- but compared to EuPathDB sites, it was not intuitive, nor a one-shop solution.

We download latest and older releases of genome, protein sequences for sequencing and proteomics searches. Having access to older versions is sometime critical as new annotations can make some things better, but miss older, correct variants.

User comments are REALLY helpful and can save huge amounts of work and make use of unpublished data, and probably some data that will never be published.

Access to datasets from published manuscripts in one place, including from paywalled journals:

This is a HUGE advantage of EuPathDB as it allows people in resource-poor countries or institutions to access such information. Since most of the disease we work on are dominant in resource poor countries, this is really, really important!

In summary- the cumulative amount of time researchers will likely need to spend to gather information from various sources will massively impact on our time spend on actual research. Consolidating information and making it searchable is probably the key value of the website.

1 Like