About ExpasyGPT ^Beta

Try ExpasyGPT

Since it was founded in 1998, the SIB Swiss Institute of Bioinformatics has been developing high-quality databases to serve the scientific community. These databases cover a wide range of data, including protein information in UniProt, gene expression in Bgee, enzymatic reactions in Rhea, protein interactions in STRING, cell lines in Cellosaurus, and orthologs in OMA and OrthoDB. All these resources are listed on Expasy, the Swiss Bioinformatics Resource Portal.

Life science data is inherently diverse, differing in format, storage, and access methods, which makes data integration and reuse particularly challenging. To overcome this, SIB has contributed to building a "semantic web" of biological data, featuring public SPARQL endpoints (see table below).

SPARQL Endpoint	Number of triples	Interconnections with SIB Knowledge Graph
Bgee	6,757,455,826	OMA, UniProt
Cellosaurus	17,826,121	None
GlyConnect	Not defined.	Not defined.
HAMAP	Not defined.	Not defined.
MetaNetX	29,852,270	OMA, SwissLipids, UniProt
OMA	584,255,821	UniProt
OrthoDB	7,627,540,081	UniProt
Rhea	4,968,188	UniProt
SIBiLS	Not defined.	Not defined.
STRING	443,541,930	UniProt
SwissLipids	15,271,369	Rhea, UniProt
UniProt	190,000,000,000	Bgee, GlyConnect, HAMAP, MetaNetX, neXtProt, OMA, OrthoDB, Rhea, STRING, SwissLipids

SPARQL endpoints (see “Some definitions” below) are specialized websites that can process so-called SPARQL queries, enabling users to perform searches that go beyond simple text-based queries such as “Find UniProt entries with a transmembrane region, with an Alanine in the 15 amino acid region preceding the transmembrane”.

Additionally, SPARQL allows users to execute federated queries, retrieving data from multiple distributed sources simultaneously such as “Identify mouse homologs in the OMA Browser for human enzymes that are involved in sterol-related reactions, as described in the Rhea database”. However, constructing SPARQL queries is often complex and requires expertise that many users do not have.

How Large Language Models Help

Advances in Large Language Models (LLMs), like ChatGPT, have opened new possibilities in natural language processing. ExpasyGPT harnesses the power of LLMs to translate user questions directly into SPARQL queries, enabling users to interact with data in plain language without needing to understand the complex SPARQL syntax or underlying knowledge graph structure.

The SIB SPARQL endpoints are developed and maintained by various research groups. Over the years, these groups have worked to enrich and harmonize endpoint documentation and metadata. ExpasyGPT builds on a large language model (LLM) using Retrieval-Augmented Generation (RAG), to incorporate the harmonized metadata and 1000 example questions—including 65 federated queries. These examples are publicly available on GitHub and are linked to their respective SPARQL endpoints (e.g., UniProt SPARQL examples). In addition to improving the LLM’s ability to generate accurate and contextual responses, these examples also support non-expert users by guiding them in constructing their own SPARQL queries.

How ExpasyGPT works

ExpasyGPT is part of a larger effort to enhance search capabilities across SIB’s interoperable databases, making life science data more accessible to everyone. By connecting an RAG-enhanced LLM to a set of SPARQL endpoints, ExpasyGPT enables access to structured biological knowledge in natural language. The currently supported endpoints include:

UniProt, an expertly curated database of proteins
OMA, the orthologous matrix
Bgee, an expertly curated gene expression database
Rhea, an expertly curated database of biochemical reactions
Cellosaurus, an expertly curated database on cell lines
SwissLipids, an expertly curated database of lipids.

As the service evolves, it will integrate additional endpoints from the SIB Knowledge Graph.

The user interface is simple and lightweight, featuring a text area where users can enter their questions. To assist with this, a set of example questions is displayed for guidance.

After the user clicks the "Submit" button, a SPARQL query based on the user's question is automatically generated. On the results page, users can run the query or modify it. To support the ongoing improvement of the tool, users can provide feedback by liking or disliking the generated SPARQL query. This feedback helps developers refine the system further.

Disclaimer: ExpasyGPT was released as a beta version. The results are provided as is. Besides, the user questions are stored for research purposes and to refine the system. See the SIB privacy policy for more information.

ExpasyGPT is a joint project between:

Knowledge Representation: Conceptualization and implementation
Biodata Resources: Project management
Information Technology: Expasy maintenance, hardware infrastructure, system administration
Semantic Web of data focus group: SPARQL endpoint operation and documentation.

at SIB Swiss Institute of Bioinformatics.

If you use intend to use this service for your research, please cite:

Vincent Emonet, Jerven Bolleman, Severine Duvaud, Tarcisio Mendes de Farias, and Ana Claudia Sima. LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs. Proceedings of the International Semantic Web Conference 2024. doi: 10.48550/arXiv.2410.06062

If you want to go further, the source code for the Expasy chat system and the reusable modules are openly available:

Chat system and components source code: github.com/sib-swiss/sparql-llm,
SPARQL examples example repository: github.com/sib-swiss/sparql-examples,
SPARQL example validator: github.com/sib-swiss/sparql-examples-utils,
VoID description generator: github.com/JervenBolleman/void-generator,
Endpoint metadata checker: sib-swiss.github.io/sparql-editor/check.

Some definitions

What is the Semantic Web?

The Semantic web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is based on the Resource Description Framework (RDF), a graph data model. Read more

Resource Description Framework (RDF) is a standard semantic graph technology suited to sharing and linking data worldwide. RDF triples consist of a subject, predicate and object. The predicate specifies the relationship between the subject and the object, each defined by a globally unique identifier. Triples can thus be represented as a graph, where the subject and object correspond to nodes, and the predicate the edge joining the nodes. By connecting all the information about entities found in RDF triples, it is possible to construct a Knowledge Graph that represents information about entities (e.g. proteins, genes, organs) and their relationships to one another (e.g. ‘is expressed in’, ‘codes for’).

Querying the Knowledge Graph with SPARQL

A SPARQL endpoint enables users (human or machines) to query the RDF data using SPARQL. The SIB databases which provide a SPARQL endpoint are listed at: https://www.expasy.org/search/sparql

SPARQL (SPARQL Protocol and RDF Query Language) is a query language for retrieving and manipulating data stored in Resource Description Framework (RDF) format. SPARQL allows search criteria for specific content to be combined, allowing the user to perform queries which cannot be answered with text-based search. It thus provides a means to mine the information stored in databases. Furthermore, SPARQL enables data distributed across multiple sources to be queried by executing federated queries.

A federated query is a special query that runs on more than one SPARQL endpoint, enabling cross-database querying and information retrieval. Although federated queries require knowledge of the data models of the databases to be queried, and of which entities are equivalent, they are nevertheless extremely powerful, allowing users to explore the data in databases worldwide, provided they have SPARQL endpoints. A set of SPARQL examples that use the different SIB resources is found at: https://sib-swiss.github.io/sparql-examples/

Helping users query the knowledge graph

Large Language Models (LLMs) are creating a shift of paradigm in how we interact with data across domains (see this video for details). Bioinformatics is one of the fields most prominently impacted by the advent of LLMs, whether for biodata exploration, via LLM-based AI assistants (2) or for dedicated, domain-specific LLMs such as Protein Language Models (3). While LLMs are prone to mistakes in factual recall, their ability to summarize and to use tools suggest new opportunities to help non-expert users query and interact with complex data, while drawing on the Knowledge Graph to improve reliability of the answers.

Writing SPARQL queries is still beyond the expertise of most users. Given the current progress in LLMs and the demonstrated importance of documentation for accessing knowledge graphs with tools like ChatGPT, the potential of these models to generate federated queries in response to user questions is being explored. In doing so, the current search capabilities of Expasy should be significantly enhanced and moving towards a unified search engine across the interoperable SIB knowledge graphs. More at: https://www.sib.swiss/news/bringing-meaning-to-biological-data-knowledge-graphs-meet-chatgpt

References

Vincent Emonet, Jerven Bolleman, Severine Duvaud, Tarcisio Mendes de Farias, and Ana Claudia Sima. LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs. Proceedings of the International Semantic Web Conference 2024. doi: 10.48550/arXiv.2410.06062
Jerven Bolleman, Vincent Emonet, et al. A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications. GigaScience Oxford Journal. doi: 10.48550/arXiv.2410.06010
SIB Swiss Institute of Bioinformatics RDF Group Members. The SIB Swiss Institute of Bioinformatics Semantic Web of data. Nucleic Acids Res. 2024 Jan 5;52(D1):D44-D51. doi: 10.1093/nar/gkad902
O’Neil ST, Schaper K, Elsarboukh G, Reese JT, Moxon SAT, Harris NL, Munoz-Torres MC, Robinson PN, Haendel MA, Mungall CJ. Phenomics Assistant: An Interface for LLM-based Biomedical Knowledge Graph Exploration. bioRxiv 2024.01.31.578275; doi: 10.1101/2024.01.31.578275
Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL Jr, Xiong C, Sun ZZ, Socher R, Fraser JS, Naik N. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2

About ExpasyGPT Beta