External Systems Biology Knowledge Integration in Large Language Models
Mentored by Jens Lehmann & Ivo Sbalzarini
at TU Dresden/Informatics
Recent advances in large language models (LLM) led to substantial performance increases in several NLP related AI tasks, in particular for generative conversational AI. However, LLMs are prone to hallucinate, e.g., chatbots can generate superficially plausible but factually incorrect responses. One of the methods to overcome this problem is the integration of structured and unstructured knowledge sources in the token generation process, e.g., via dense embeddings (such as fusion-in-decoder mechanisms) or textual embeddings (via in-context learning). In this PhD thesis, we want to explore methods for generative generic retrieval in chatbots, i.e., approaches in which the core LLM is trained to query for external knowledge when it is suitable. We want to investigate (a) how to train LLMs to be able to decide when to query for external knowledge, (b) how to enable LLMs to rewrite the user input such that they can obtain information from different specific sources (multi-hop and multi-source reasoning) and (c) what characteristics LLMs need to satisfy such that they can combine their own parametric knowledge with external knowledge to maximize factual accuracy of responses.
The main application scenario in which we will investigate the above research question is a chatbot for systems biology. This means that our external knowledge sources will cover relevant systems biology knowledge such as (i) publications and books in the field, (ii) public databases of systems biology pathway models, molecular information, and species information, like SwissModel, NCBI or KEGG, and (iii) structured knowledge such as knowledge graphs and ontologies developed for systems biology, for example by the ICSB. Researchers and practitioners will be able to use the chatbot in order to ask questions related to molecular interactions in biological systems, biophysical questions, and questions about regulatory motifs as well as similarities and differences in them across species. In contrast to other LLMs, the output should be factually correct, up to date (without requiring retraining of the underlying LLM) and explainable (in the sense that sources are cited). Moreover, we want to investigate whether the LLM be extended to include math functionality and information from GitHub repositories of systems biology research code. This would enable cross-modal queries including factual information, mathematical models/equations, and source code. A system like this is crucial for researchers to be able to navigate and comprehend the enormous amounts of information and data available about biological systems, in order to form the intellectual fusion to understand the inner workings of life.
You will be a member of a collaborative project team working at the cutting edge of Computer Science within the SECAI Project. You will be supervised by Prof. Jens Lehmann (a world-leading expert in language models and knowledge graphs) and Prof. Ivo Sbalzarini (a renowned long-time computational systems biologist) in the faculty of computer science of TU Dresden as well as direct mentoring by Dr. Sahar Vahdati and Dr. Nandu Gopan. You will be contributing to the development of theory, models and algorithms for language models in systems biology.
You will have access to the machine-learning HPC resources of the ScaDS.AI Center for Scalable Data Analytics and AI Dresden/Leipzig, offering state-of-the-art CPU (Intel and IBM POWER 9) and GPU (Nvidia A100 and V100) resources, as well as high-performance computing and storage systems for computational experiments. You will also be working with the machine learning and AI people at the Center for Systems Biology Dresden (CSBD), a joint center between the Max Planck Society and TU Dresden performing state-of-the-art research in systems biology, ensuring access to data and relevant real-world biological questions.
To conduct this research, you should hold a very good university degree (MSc or an equivalent) in computer science or related disciplines (such as mathematics). You should have a strong background in design, development, training and evaluation of machine learning approaches. An interest in systems biology and the willingness to learn the application vocabulary are mandatory, while previous experience in working with biological systems and data is desirable. Furthermore, the required background knowledge includes:
- Machine learning
- Representation learning / language models
- Conversational AI
- Interest in Systems Biology, ideally previous experience in interdisciplinary work between computer science and biology.
- Fluent English
- Python (in particular PyTorch, Huggingface libraries)
- C++/CUDA (in particular TensorFlow and SysBio APIs)