Computational Methods for Genome Assembly, Transcript Assembly, and Gene Discovery View Homepage


Ontology type: schema:MonetaryGrant     


Grant Info

YEARS

1999-2025

FUNDING AMOUNT

6427394.0 USD

ABSTRACT

Project Summary Improvements in sequencing technology have spurred a tremendous increase in the use of sequencing to answer a wide range of questions in biology and medicine. Thousands of new human genomes are being sequenced each year in efforts to track down the genetic causes of human diseases. In parallel with this increase in whole-genome sequencing, RNA sequencing has also exploded in popularity, due to its power to characterize gene expression in a multitude of cell types and conditions, and to its potential to discover new genes and new splice variants. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also presents opportunities for discovery. Furthermore, to properly analyze the many diverse humans being sequenced, we can no longer afford to rely on a single reference genome that is missing much of the variation found in the human population, and that makes it very difficult to analyze sequences that do not match the reference. We propose to address these challenges in four specific ways: first, we will develop new and improved assembly algorithms that take advantage of the latest long-read technology to create genomes of unprecedented contiguity and completeness. This effort will include a method for creating haplotype-resolved assemblies when sequences from both parents are available, and a method to use an existing reference genome to create a highly contiguous assembly at minimal cost. Second, we will apply these methods to build new human reference genomes, assembled and annotated as thoroughly as the current human reference. These genomes, each representing a single individual, can then serve as the basis for many future studies of the relevant populations. Third, in the area of RNA-seq analysis our lab has previously developed two widely-used spliced aligners, TopHat and HISAT, and two equally popular transcriptome assemblers, Cufflinks and StringTie, which now have many thousands of users. We will extend and improve the StringTie algorithm, augmenting its novel network flow algorithm with de novo assembly plus new alignment methods to handle long reads and to improve its construction and quantification of transcripts. Fourth, we propose to systematically assemble thousands of RNA-seq experiments to discover new genes and to re-build the human gene catalog, an effort that could have a major impact on a broad array of human genetic and genomic studies. We have recently released our first version of this effort as CHESS, a human gene catalog built from a massive RNA-seq database that represents a comprehensive, reproducible, and open method for annotating the human genome. The CHESS database already agrees more closely with the two most widely-used human gene databases than either of them agree with one another, and we will improve it further so that it can provide a basis for biomedical research for many years to come. More... »

URL

http://projectreporter.nih.gov/project_info_description.cfm?aid=10343810

Related SciGraph Publications

  • 2022-09-28. Metagenome analysis using the Kraken software suite in NATURE PROTOCOLS
  • 2022-04-19. High-quality genome and methylomes illustrate features underlying evolutionary success of oaks in NATURE COMMUNICATIONS
  • 2022-03-31. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies in NATURE METHODS
  • 2020-08-28. Ultrafast and accurate 16S rRNA microbial community analysis using Kraken 2 in MICROBIOME
  • 2020-06-02. Assembly and annotation of an Ashkenazi human reference genome in GENOME BIOLOGY
  • 2020-05-12. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank in GENOME BIOLOGY
  • 2020-03-18. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes in NATURE COMMUNICATIONS
  • 2020-02-07. Pan-genomics in the human genome era in NATURE REVIEWS GENETICS
  • 2020-01-02. Evolutionary superscaffolding and chromosome anchoring to improve Anopheles genome assemblies in BMC BIOLOGY
  • 2019-12-17. Recovering rearranged cancer chromosomes from karyotype graphs in BMC BIOINFORMATICS
  • 2019-12-16. Transcriptome assembly from long-read RNA-seq alignments with StringTie2 in GENOME BIOLOGY
  • 2019-10-28. RaGOO: fast and accurate reference-guided scaffolding of draft genomes in GENOME BIOLOGY
  • 2019-09-30. The bracteatus pineapple genome and domestication of clonally propagated crops in NATURE GENETICS
  • 2019-08-12. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome in NATURE BIOTECHNOLOGY
  • 2019-08-02. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype in NATURE BIOTECHNOLOGY
  • 2019-05-16. Next-generation genome annotation: we still struggle to get it right in GENOME BIOLOGY
  • 2019-05-16. Addressing confounding artifacts in reconstruction of gene co-expression networks in GENOME BIOLOGY
  • 2019-05-08. Hypo-osmotic-like stress underlies general cellular defects of aneuploidy in NATURE
  • 2019-03-01. A multi-task convolutional deep neural network for variant calling in single molecule sequencing in NATURE COMMUNICATIONS
  • 2018-11-28. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise in GENOME BIOLOGY
  • 2018-11-19. Assembly of a pan-genome from deep sequencing of 910 humans of African descent in NATURE GENETICS
  • 2018-11-16. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts in GENOME BIOLOGY
  • 2018-08-20. Open questions: How many genes do we have? in BMC BIOLOGY
  • 2018-04-30. Accurate detection of complex structural variations using single-molecule sequencing in NATURE METHODS
  • 2018-03-29. Piercing the dark matter: bioinformatics of long-range sequencing and mapping in NATURE REVIEWS GENETICS
  • 2017-05-08. Recurrent noncoding regulatory mutations in pancreatic ductal adenocarcinoma in NATURE GENETICS
  • 2017-05-08. Horizontal gene transfer is not a hallmark of the human genome in GENOME BIOLOGY
  • 2017-01-24. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast in NATURE COMMUNICATIONS
  • 2016-11-17. Indel variant analysis of short-read sequencing data with Scalpel in NATURE PROTOCOLS
  • 2016-10-17. Phased diploid genome assembly with single-molecule real-time sequencing in NATURE METHODS
  • 2016-08-11. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown in NATURE PROTOCOLS
  • 2016-01-30. CIDANE: comprehensive isoform discovery and abundance estimation in GENOME BIOLOGY
  • 2015-11-03. Use and mis-use of supplementary material in science publications in BMC BIOINFORMATICS
  • 2015-10-22. Teaser: Individualized benchmarking and optimization of read mapping results for NGS data in GENOME BIOLOGY
  • 2015-09-24. Metassembler: merging and optimizing de novo genome assemblies in GENOME BIOLOGY
  • 2015-09-07. Interactive analysis and assessment of single-cell copy-number variations in NATURE METHODS
  • 2015-03-09. HISAT: a fast spliced aligner with low memory requirements in NATURE METHODS
  • 2015-03-06. Ballgown bridges the gap between transcriptome assembly and expression analysis in NATURE BIOTECHNOLOGY
  • 2015-02-18. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads in NATURE BIOTECHNOLOGY
  • 2014-11. Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica in GENOME BIOLOGY
  • 2014-10-28. Reducing INDEL calling errors in whole genome and exome sequencing data in GENOME MEDICINE
  • 2014-08-17. Accurate de novo and transmitted indel detection in exome-capture data using microassembly in NATURE METHODS
  • 2014-01-01. Kraken: ultrafast metagenomic sequence classification using exact alignments in GENOME BIOLOGY
  • 2014. Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa , document novel gene space of aus and indica in GENOME BIOLOGY
  • 2013-07-22. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species in GIGASCIENCE
  • 2013-06. The advantages of SMRT sequencing in GENOME BIOLOGY
  • 2013-05-10. Genome of the long-living sacred lotus (Nelumbo nucifera Gaertn.) in GENOME BIOLOGY
  • 2013-04-25. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions in GENOME BIOLOGY
  • 2013-01-01. The advantages of SMRT sequencing in GENOME BIOLOGY
  • 2012-10-22. Gene expression anti-profiles as a basis for accurate universal cancer signatures in BMC BIOINFORMATICS
  • JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/31", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "type": "DefinedTerm"
          }
        ], 
        "amount": {
          "currency": "USD", 
          "type": "MonetaryAmount", 
          "value": 6427394.0
        }, 
        "description": "Project Summary Improvements in sequencing technology have spurred a tremendous increase in the use of sequencing to answer a wide range of questions in biology and medicine. Thousands of new human genomes are being sequenced each year in efforts to track down the genetic causes of human diseases. In parallel with this increase in whole-genome sequencing, RNA sequencing has also exploded in popularity, due to its power to characterize gene expression in a multitude of cell types and conditions, and to its potential to discover new genes and new splice variants. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also presents opportunities for discovery. Furthermore, to properly analyze the many diverse humans being sequenced, we can no longer afford to rely on a single reference genome that is missing much of the variation found in the human population, and that makes it very difficult to analyze sequences that do not match the reference. We propose to address these challenges in four specific ways: first, we will develop new and improved assembly algorithms that take advantage of the latest long-read technology to create genomes of unprecedented contiguity and completeness. This effort will include a method for creating haplotype-resolved assemblies when sequences from both parents are available, and a method to use an existing reference genome to create a highly contiguous assembly at minimal cost. Second, we will apply these methods to build new human reference genomes, assembled and annotated as thoroughly as the current human reference. These genomes, each representing a single individual, can then serve as the basis for many future studies of the relevant populations. Third, in the area of RNA-seq analysis our lab has previously developed two widely-used spliced aligners, TopHat and HISAT, and two equally popular transcriptome assemblers, Cufflinks and StringTie, which now have many thousands of users. We will extend and improve the StringTie algorithm, augmenting its novel network flow algorithm with de novo assembly plus new alignment methods to handle long reads and to improve its construction and quantification of transcripts. Fourth, we propose to systematically assemble thousands of RNA-seq experiments to discover new genes and to re-build the human gene catalog, an effort that could have a major impact on a broad array of human genetic and genomic studies. We have recently released our first version of this effort as CHESS, a human gene catalog built from a massive RNA-seq database that represents a comprehensive, reproducible, and open method for annotating the human genome. The CHESS database already agrees more closely with the two most widely-used human gene databases than either of them agree with one another, and we will improve it further so that it can provide a basis for biomedical research for many years to come.", 
        "endDate": "2025-02-28", 
        "funder": {
          "id": "http://www.grid.ac/institutes/grid.280128.1", 
          "type": "Organization"
        }, 
        "id": "sg:grant.2529453", 
        "identifier": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "grant.2529453"
            ]
          }, 
          {
            "name": "nih_id", 
            "type": "PropertyValue", 
            "value": [
              "R01HG006677"
            ]
          }
        ], 
        "keywords": [
          "human gene catalog", 
          "reference genome", 
          "gene catalog", 
          "new genes", 
          "human genome", 
          "single reference genome", 
          "de novo assembly", 
          "haplotype-resolved assemblies", 
          "human gene database", 
          "RNA-seq analysis", 
          "RNA-Seq database", 
          "RNA-seq experiments", 
          "human reference genome", 
          "new splice variant", 
          "quantification of transcripts", 
          "whole-genome sequencing", 
          "genome assembly", 
          "gene discovery", 
          "novo assembly", 
          "genomic studies", 
          "contiguous assemblies", 
          "transcript assembly", 
          "gene database", 
          "RNA sequencing", 
          "sequencing technologies", 
          "genome", 
          "use of sequencing", 
          "gene expression", 
          "long reads", 
          "human diseases", 
          "splice variants", 
          "cell types", 
          "transcriptome assemblers", 
          "diverse human", 
          "sequencing", 
          "human population", 
          "genetic cause", 
          "human reference", 
          "computational methods", 
          "genes", 
          "assembly", 
          "enormous data sets", 
          "assembly algorithm", 
          "single individual", 
          "biomedical research", 
          "thousands of users", 
          "sequence", 
          "accurate computational method", 
          "broad array", 
          "network flow algorithm", 
          "discovery", 
          "Cufflinks", 
          "transcripts", 
          "biology", 
          "TopHat", 
          "HISAT", 
          "StringTie", 
          "thousands", 
          "chess database", 
          "reads", 
          "alignment method", 
          "population", 
          "expression", 
          "flow algorithm", 
          "major impact", 
          "algorithm", 
          "data sets", 
          "future studies", 
          "new alignment method", 
          "first version", 
          "minimal cost", 
          "variants", 
          "tremendous increase", 
          "wide range", 
          "humans", 
          "database", 
          "basis", 
          "technology", 
          "relevant population", 
          "users", 
          "multitude", 
          "catalogue", 
          "assemblers", 
          "variation", 
          "analysis", 
          "efforts", 
          "chess", 
          "contiguity", 
          "increase", 
          "popularity", 
          "study", 
          "method", 
          "individuals", 
          "aligners", 
          "specific ways", 
          "potential", 
          "set", 
          "disease", 
          "quantification", 
          "array", 
          "types", 
          "Summary Improvements", 
          "completeness", 
          "experiments", 
          "cost", 
          "challenges", 
          "advantages", 
          "conditions", 
          "version", 
          "parents", 
          "parallel", 
          "lab", 
          "way", 
          "area", 
          "range", 
          "medicine", 
          "impact", 
          "opportunities", 
          "years", 
          "construction", 
          "research", 
          "questions", 
          "reference", 
          "cause", 
          "improvement", 
          "power", 
          "use", 
          "open method"
        ], 
        "name": "Computational Methods for Genome Assembly, Transcript Assembly, and Gene Discovery", 
        "recipient": [
          {
            "id": "http://www.grid.ac/institutes/grid.21107.35", 
            "type": "Organization"
          }, 
          {
            "affiliation": {
              "id": "http://www.grid.ac/institutes/None", 
              "name": "JOHNS HOPKINS UNIVERSITY", 
              "type": "Organization"
            }, 
            "familyName": "SALZBERG", 
            "givenName": "STEVEN L.", 
            "id": "sg:person.01223441713.02", 
            "type": "Person"
          }, 
          {
            "member": "sg:person.01223441713.02", 
            "roleName": "PI", 
            "type": "Role"
          }
        ], 
        "sameAs": [
          "https://app.dimensions.ai/details/grant/grant.2529453"
        ], 
        "sdDataset": "grants", 
        "sdDatePublished": "2022-11-24T21:22", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-springernature-scigraph/baseset/20221124/entities/gbq_results/grant/grant_83.jsonl", 
        "startDate": "1999-09-01", 
        "type": "MonetaryGrant", 
        "url": "http://projectreporter.nih.gov/project_info_description.cfm?aid=10343810"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/grant.2529453'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/grant.2529453'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/grant.2529453'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/grant.2529453'


     

    This table displays all metadata directly associated to this object as RDF triples.

    172 TRIPLES      18 PREDICATES      149 URIs      141 LITERALS      5 BLANK NODES

    Subject Predicate Object
    1 sg:grant.2529453 schema:about anzsrc-for:31
    2 schema:amount Ndc3a6780a1e14f76ad77d5965b77e58a
    3 schema:description Project Summary Improvements in sequencing technology have spurred a tremendous increase in the use of sequencing to answer a wide range of questions in biology and medicine. Thousands of new human genomes are being sequenced each year in efforts to track down the genetic causes of human diseases. In parallel with this increase in whole-genome sequencing, RNA sequencing has also exploded in popularity, due to its power to characterize gene expression in a multitude of cell types and conditions, and to its potential to discover new genes and new splice variants. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also presents opportunities for discovery. Furthermore, to properly analyze the many diverse humans being sequenced, we can no longer afford to rely on a single reference genome that is missing much of the variation found in the human population, and that makes it very difficult to analyze sequences that do not match the reference. We propose to address these challenges in four specific ways: first, we will develop new and improved assembly algorithms that take advantage of the latest long-read technology to create genomes of unprecedented contiguity and completeness. This effort will include a method for creating haplotype-resolved assemblies when sequences from both parents are available, and a method to use an existing reference genome to create a highly contiguous assembly at minimal cost. Second, we will apply these methods to build new human reference genomes, assembled and annotated as thoroughly as the current human reference. These genomes, each representing a single individual, can then serve as the basis for many future studies of the relevant populations. Third, in the area of RNA-seq analysis our lab has previously developed two widely-used spliced aligners, TopHat and HISAT, and two equally popular transcriptome assemblers, Cufflinks and StringTie, which now have many thousands of users. We will extend and improve the StringTie algorithm, augmenting its novel network flow algorithm with de novo assembly plus new alignment methods to handle long reads and to improve its construction and quantification of transcripts. Fourth, we propose to systematically assemble thousands of RNA-seq experiments to discover new genes and to re-build the human gene catalog, an effort that could have a major impact on a broad array of human genetic and genomic studies. We have recently released our first version of this effort as CHESS, a human gene catalog built from a massive RNA-seq database that represents a comprehensive, reproducible, and open method for annotating the human genome. The CHESS database already agrees more closely with the two most widely-used human gene databases than either of them agree with one another, and we will improve it further so that it can provide a basis for biomedical research for many years to come.
    4 schema:endDate 2025-02-28
    5 schema:funder grid-institutes:grid.280128.1
    6 schema:identifier N5c96e3ba8dc6480c98b1e8f6b80169ff
    7 N827fa7ca0db041ff8c5c0340bd484f41
    8 schema:keywords Cufflinks
    9 HISAT
    10 RNA sequencing
    11 RNA-Seq database
    12 RNA-seq analysis
    13 RNA-seq experiments
    14 StringTie
    15 Summary Improvements
    16 TopHat
    17 accurate computational method
    18 advantages
    19 algorithm
    20 aligners
    21 alignment method
    22 analysis
    23 area
    24 array
    25 assemblers
    26 assembly
    27 assembly algorithm
    28 basis
    29 biology
    30 biomedical research
    31 broad array
    32 catalogue
    33 cause
    34 cell types
    35 challenges
    36 chess
    37 chess database
    38 completeness
    39 computational methods
    40 conditions
    41 construction
    42 contiguity
    43 contiguous assemblies
    44 cost
    45 data sets
    46 database
    47 de novo assembly
    48 discovery
    49 disease
    50 diverse human
    51 efforts
    52 enormous data sets
    53 experiments
    54 expression
    55 first version
    56 flow algorithm
    57 future studies
    58 gene catalog
    59 gene database
    60 gene discovery
    61 gene expression
    62 genes
    63 genetic cause
    64 genome
    65 genome assembly
    66 genomic studies
    67 haplotype-resolved assemblies
    68 human diseases
    69 human gene catalog
    70 human gene database
    71 human genome
    72 human population
    73 human reference
    74 human reference genome
    75 humans
    76 impact
    77 improvement
    78 increase
    79 individuals
    80 lab
    81 long reads
    82 major impact
    83 medicine
    84 method
    85 minimal cost
    86 multitude
    87 network flow algorithm
    88 new alignment method
    89 new genes
    90 new splice variant
    91 novo assembly
    92 open method
    93 opportunities
    94 parallel
    95 parents
    96 popularity
    97 population
    98 potential
    99 power
    100 quantification
    101 quantification of transcripts
    102 questions
    103 range
    104 reads
    105 reference
    106 reference genome
    107 relevant population
    108 research
    109 sequence
    110 sequencing
    111 sequencing technologies
    112 set
    113 single individual
    114 single reference genome
    115 specific ways
    116 splice variants
    117 study
    118 technology
    119 thousands
    120 thousands of users
    121 transcript assembly
    122 transcriptome assemblers
    123 transcripts
    124 tremendous increase
    125 types
    126 use
    127 use of sequencing
    128 users
    129 variants
    130 variation
    131 version
    132 way
    133 whole-genome sequencing
    134 wide range
    135 years
    136 schema:name Computational Methods for Genome Assembly, Transcript Assembly, and Gene Discovery
    137 schema:recipient N3449ee6913c94b49b38431b569759b40
    138 sg:person.01223441713.02
    139 grid-institutes:grid.21107.35
    140 schema:sameAs https://app.dimensions.ai/details/grant/grant.2529453
    141 schema:sdDatePublished 2022-11-24T21:22
    142 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    143 schema:sdPublisher Nddf90fa0021b44a48a5dba72876db1a7
    144 schema:startDate 1999-09-01
    145 schema:url http://projectreporter.nih.gov/project_info_description.cfm?aid=10343810
    146 sgo:license sg:explorer/license/
    147 sgo:sdDataset grants
    148 rdf:type schema:MonetaryGrant
    149 N3449ee6913c94b49b38431b569759b40 schema:member sg:person.01223441713.02
    150 schema:roleName PI
    151 rdf:type schema:Role
    152 N5c96e3ba8dc6480c98b1e8f6b80169ff schema:name nih_id
    153 schema:value R01HG006677
    154 rdf:type schema:PropertyValue
    155 N827fa7ca0db041ff8c5c0340bd484f41 schema:name dimensions_id
    156 schema:value grant.2529453
    157 rdf:type schema:PropertyValue
    158 Ndc3a6780a1e14f76ad77d5965b77e58a schema:currency USD
    159 schema:value 6427394.0
    160 rdf:type schema:MonetaryAmount
    161 Nddf90fa0021b44a48a5dba72876db1a7 schema:name Springer Nature - SN SciGraph project
    162 rdf:type schema:Organization
    163 anzsrc-for:31 schema:inDefinedTermSet anzsrc-for:
    164 rdf:type schema:DefinedTerm
    165 sg:person.01223441713.02 schema:affiliation grid-institutes:None
    166 schema:familyName SALZBERG
    167 schema:givenName STEVEN L.
    168 rdf:type schema:Person
    169 grid-institutes:None schema:name JOHNS HOPKINS UNIVERSITY
    170 rdf:type schema:Organization
    171 grid-institutes:grid.21107.35 schema:Organization
    172 grid-institutes:grid.280128.1 schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...