About ExpasyGPT
Since it was founded in 1998, the SIB Swiss Institute of Bioinformatics has been developing high-quality databases to serve the scientific community. These databases cover a wide range of data, including protein information in UniProt, gene expression in Bgee, enzymatic reactions in Rhea, protein interactions in STRING, cell lines in Cellosaurus, and orthologs in OMA and OrthoDB. All these resources are listed on Expasy, the Swiss Bioinformatics Resource Portal.
Life science data is inherently diverse, differing in format, storage, and access methods, which makes data integration and reuse particularly challenging. To overcome this, SIB has contributed to building a "semantic web" of biological data, featuring public SPARQL endpoints (see table below).
SPARQL Endpoint | Number of triples | Interconnections with SIB Knowledge Graph |
Bgee | 6,757,455,826 | OMA, UniProt |
Cellosaurus | 16,340,172,623 | None |
GlyConnect | Not defined. | Not defined. |
HAMAP | Not defined. | Not defined. |
MetaNetX | 29,852,270 | OMA, SwissLipids, UniProt |
neXtProt | 2,130,123,305 | Bgee, Cellosaurus, GlyConnect, UniProt |
OMA | 584,255,821 | UniProt |
OrthoDB | 7,627,540,081 | UniProt |
Rhea | 4,968,188 | UniProt |
SIBiLS | Not defined. | Not defined. |
STRING | 443,541,930 | UniProt |
SwissLipids | 15,271,369 | Rhea, UniProt |
UniProt | 190,000,000,000 | Bgee, GlyConnect, HAMAP, MetaNetX, neXtProt, OMA, OrthoDB, Rhea, STRING, SwissLipids |
SPARQL endpoints (see “Some definitions” below) are specialized websites that can process so-called SPARQL queries, enabling users to perform searches that go beyond simple text-based queries such as “Find UniProt entries with a transmembrane region, with an Alanine in the 15 amino acid region preceding the transmembrane”.
Additionally, SPARQL allows users to execute federated queries, retrieving data from multiple distributed sources simultaneously such as “Identify mouse homologs in the OMA Browser for human enzymes that are involved in sterol-related reactions, as described in the Rhea database”. However, constructing SPARQL queries is often complex and requires expertise that many users do not have.
How Large Language Models Help
Advances in Large Language Models (LLMs), like ChatGPT, have opened new possibilities in natural language processing. ExpasyGPT harnesses the power of LLMs to translate user questions directly into SPARQL queries, enabling users to interact with data in plain language without needing to understand the complex SPARQL syntax or underlying knowledge graph structure.
The SIB SPARQL endpoints are developed and maintained by various research groups. Over the years, these groups have worked to enrich and harmonize endpoint documentation and metadata. ExpasyGPT’s models are trained on the harmonized metadata, along with 1000 example questions, including 65 federated queries, all available on GitHub and accessible through their respective SPARQL endpoints (e.g., UniProt SPARQL examples). These examples also help non-expert users more easily construct their own queries.
How ExpasyGPT work
ExpasyGPT is part of a larger effort to enhance search capabilities across SIB’s interoperable databases, making life science data more accessible to everyone.
It has been trained using the SPARQL endpoints’ metadata and example queries from:
- UniProt, an expertly curated database of proteins
- OMA, the orthologous matrix
- Bgee, an expertly curated gene expression database
- Rhea, an expertly curated database of biochemical reactions
- SwissLipids, an expertly curated database of lipids.
As the service evolves, it will integrate additional endpoints from the SIB Knowledge Graph.
The user interface is simple and lightweight, featuring a text area where users can enter their questions. To assist with this, a set of example questions is displayed for guidance.
After the user clicks the "Submit" button, a SPARQL query based on the user's question is automatically generated. On the results page, users can run the query or modify it. To support the ongoing improvement of the tool, users can provide feedback by liking or disliking the generated SPARQL query. This feedback helps developers refine the system further.
Disclaimer: ExpasyGPT was released as a beta version. The results are provided as is. Besides, the user questions are stored for research purposes and to refine the system. See the SIB privacy policy for more information.
ExpasyGPT is a joint project between:
- Knowledge Representation: Conceptualization and implementation
- Biodata Resources: Project management
- Information Technology: Expasy maintenance, hardware infrastructure, system administration
- Semantic Web of data focus group: SPARQL endpoint operation and documentation.
at SIB Swiss Institute of Bioinformatics.
If you use intend to use this service for your research, please cite:
- Vincent Emonet, Jerven Bolleman, Severine Duvaud, Tarcisio Mendes de Farias, and Ana Claudia Sima. LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs. Proceedings of the International Semantic Web Conference 2024. doi: 10.48550/arXiv.2410.06062
If you want to go further, the source code for the Expasy chat system and the reusable modules are openly available:
- Chat system and components source code: github.com/sib-swiss/sparql-llm,
- SPARQL examples example repository: github.com/sib-swiss/sparql-examples,
- SPARQL example validator: github.com/sib-swiss/sparql-examples-utils,
- VoID description generator: github.com/JervenBolleman/void-generator,
- Endpoint metadata checker: sib-swiss.github.io/sparql-editor/check.
Some definitions
What is the Semantic Web?
Semantic web is a term first used by web pioneer and World Wide Web Consortium founder Tim Berners-Lee in 1999. The Semantic Web is a web of linked data that allows both humans and machines to navigate between databases that store information about the same entity (i.e., gene, protein, etc.). Several SIB databases are part of the global semantic web. They do so by providing their data in RDF accessible through SPARQL endpoints (1).
Resource Description Framework (RDF) is a standard semantic graph technology suited to sharing and linking data worldwide. RDF triples consist of a subject, predicate and object. The predicate specifies the relationship between the subject and the object, each defined by a globally unique identifier. Triples can thus be represented as a graph, where the subject and object correspond to nodes, and the predicate the edge joining the nodes. By connecting all the information about entities found in RDF triples, it is possible to construct a Knowledge Graph that stores information about entities (e.g. proteins, genes, organs) and their relationships to one another (e.g. ‘is expressed in’, ‘codes for’).
Querying the Knowledge Graph with SPARQL
A SPARQL endpoint enables users (human or machines) to query the RDF data using SPARQL. The SIB databases which provide a SPARQL endpoint are listed at: https://www.expasy.org/search/sparql
SPARQL (SPARQL Protocol and RDF Query Language) is a query language for retrieving and manipulating data stored in Resource Description Framework (RDF) format. SPARQL allows search criteria for specific content to be combined, allowing the user to perform queries which cannot be answered with text-based search. It thus provides a means to mine the information stored in databases. Furthermore, SPARQL enables data distributed across multiple sources to be queried by executing federated queries.
A federated query is a special query that runs on more than one SPARQL endpoint, enabling cross-database querying and information retrieval. Although federated queries require knowledge of the data models of the databases to be queried, and of which entities are equivalent, they are nevertheless extremely powerful, allowing users to explore the data in databases worldwide, provided they have SPARQL endpoints. A set of SPARQL examples that use the different SIB resources is found at: https://sib-swiss.github.io/sparql-examples/
Helping users query the knowledge graph
Large Language Models (LLMs) are creating a shift of paradigm in how we interact with data across domains (see this video for details). Bioinformatics is one of the fields most prominently impacted by the advent of LLMs, whether for biodata exploration, via LLM-based AI assistants (2) or for dedicated, domain-specific LLMs such as Protein Language Models (3). While LLMs are prone to mistakes in factual recall, their ability to summarize and to use tools suggest new opportunities to help non-expert users query and interact with complex data, while drawing on the Knowledge Graph to improve reliability of the answers.
Writing SPARQL queries is still beyond the expertise of most users. Given the current progress in LLMs and the demonstrated importance of documentation for accessing knowledge graphs with tools like ChatGPT, the potential of these models to generate federated queries in response to user questions is being explored. In doing so, the current search capabilities of Expasy should be significantly enhanced and moving towards a unified search engine across the interoperable SIB knowledge graphs. More at: https://www.sib.swiss/news/bringing-meaning-to-biological-data-knowledge-graphs-meet-chatgpt
References
- Vincent Emonet, Jerven Bolleman, Severine Duvaud, Tarcisio Mendes de Farias, and Ana Claudia Sima. LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs. Proceedings of the International Semantic Web Conference 2024. doi: 10.48550/arXiv.2410.06062
- Jerven Bolleman, Vincent Emonet, et al. A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications. GigaScience Oxford Journal. doi: 10.48550/arXiv.2410.06010
- SIB Swiss Institute of Bioinformatics RDF Group Members. The SIB Swiss Institute of Bioinformatics Semantic Web of data. Nucleic Acids Res. 2024 Jan 5;52(D1):D44-D51. doi: 10.1093/nar/gkad902
- O’Neil ST, Schaper K, Elsarboukh G, Reese JT, Moxon SAT, Harris NL, Munoz-Torres MC, Robinson PN, Haendel MA, Mungall CJ. Phenomics Assistant: An Interface for LLM-based Biomedical Knowledge Graph Exploration. bioRxiv 2024.01.31.578275; doi: 10.1101/2024.01.31.578275
- Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL Jr, Xiong C, Sun ZZ, Socher R, Fraser JS, Naik N. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2