Integration of Structured Knowledge into Language Models for Cell Biology
Mentored by Kristin Reiche, Jens Lehmann
at Leipzig University
Characterizing biological cells by single cell sequencing has dramatically transformed biomedical research. The molecular characterization of individual cells from different tissues informs, for example, drug development, drug repurposing and the risk assessment of side effects. The advancement of Large Language Models (LLMs) in natural language processing has led to their application in genomics and single cell studies. The first pre-trained language models for cell biology are available [1-4], with examples including scBERT [1] for cell-type annotation and scGPT [2], which is the first foundation model for tasks like cell-type annotation, multi-batch integration, multi-omic integration, perturbation response prediction and gene network inference. However, our understanding to which extend these cell language models generate misleading information is still in its early stages. Knowledge of biological entities and processes is predominantly made available in form of graph-structured data (e.g. Gene Ontology, InterPro, Reactome). In this PhD thesis, you will integrate such biological knowledge into generative models e.g. (i) in the form of knowledge graph embeddings [5, 6], (ii) projections of structured data similar to vision encoders [7] or (iii) data enrichment techniques using suitable queries over biological knowledge. The underlying models will have been pretrained with single-cell multi-omics data. The primary objective is to improve accuracy and inference capabilities of LLMs for (single) cell biology.
Short outline of tasks:
- Select a pre-trained cell language model for downstream analyses w.r.t. pre-defined criteria.
- Integrate biological knowledge into the cell language model.
- Define learning objectives and implement those into the model.
- Benchmark the combined model (cell language model and knowledge graph) against existing models to evaluate its performance.
- Apply combined model on selected tasks, like optimal target identification for immunotherapies.
Work Environment
Leipzig University/Fraunhofer IZI and TU Dresden
Appointment can either be at the Interdisciplinary Center for Bioinformatics in Leipzig or TU Dresden. You will collaborate closely with the bioinformatics unit at the Fraunhofer Institute for Cell Therapy and Immunology – IZI in Leipzig. In addition, you will have access to the HPC resources provided by the ScaDS.AI Center for Scalable Data Analytics and AI Dresden/Leipzig. Moreover, travel between Leipzig University and TU Dresden is anticipated to enable seamless collaboration and knowledge exchange.
SECAI offers a first-class environment for advancing your career. You can work with internationally renowned researchers and benefit from the school’s strong networks in industry and research. The graduation of highly qualified researchers is a central project goal in SECAI and doctoral students receive strong support for their professional and personal development.
Prerequisites
- A master’s degree in bioinformatics, computer/data science, mathematics, biology, chemistry or physics.
- Experiences with at least one of following programming languages: Python or C/C++.
- Sound background in machine learning, statistical learning and neural networks.
- First experience with ML/DL frameworks and related software like (Py)Torch, TensorFlow, Keras SciKitlearn, MLflow or others.
- Sound background in Unix/Linux operating systems, the workload manager SLURM and the version control system Git.
- A core understanding of genetics, molecular biology and computational biology.
- First experiences with the processing and statistical analysis of multi-modal biomedical datasets (omics on single cell resolution is a plus).
- Excellent communication and inter-personal skills and being capable of working with an interdisciplinary team.
- Excellent written and verbal skills in English.
[1] Fan Yang et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence volume 4, pages852–866 (2022)
[2] Haotian Cui et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods. 2024 Feb 26.
[3] Hongzhi Wen et al. CellPLM: Pre-training of Cell Language Model Beyond Single Cells. Preprint http://dx.doi.org/10.1101/2023.10.03.560734
[4] Yunha Hwang et al. Genomic language model predicts protein co-regulation and function. Nat Commun. 2024 Apr 3;15(1):2880.
[5] Mehdi Ali et al. BioKEEN: a library for learning and evaluating biological knowledge graph embeddings. Bioinformatics. 2019 Sep 15;35(18):3538-3540.
[6] Mehdi Ali et al. PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings. Journal of Machine Learning Research. 2021.
[7] Liu, Haotian, et al. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Topics All Topics
Artificial Intelligence-Accelerated Drug Discovery on the SpiNNaker2 Platform
BioAI: Bio-Computing for Sustainable and Trustworthy Artificial Intelligence Computation
Explainable Graph Analysis in Declarative and Logical Languages
Integration of Structured Knowledge into Language Models for Cell Biology
Learning the Rules of Molecular Design
Non-Monotonic Uncertainty Handling and Learning
SAVi: Semantic Analysis of Surgical Videos
Tact-Morph: Tactile Sensor & Robotics Processing on the SpiNNaker2 Neuromorphic Compute Platform