Privacy Preservation for Synthetic Data Generation Using Deep Generative Models and Large Language Models (LLMs): A Case Study of Genomic Data - PhD Dissertation Proposal by: Reem Al-Saidi

Thursday, March 27, 2025 - 14:00

The School of Computer Science is pleased to present…

Privacy Preservation for Synthetic Data Generation Using Deep Generative Models and Large Language Models (LLMs): A Case Study of Genomic Data

PhD Dissertation Proposal by: Reem Al-Saidi

 

Date: Thursday, March 27th, 2025

Time:  2:00 PM

Location: Chrysler Hall South, Room 51

Abstract:

Genomic sequencing data has become indispensable for various analyses that advance our understanding of human health, genetic diseases, and the development of personalized medicine. However, sharing or publishing this data, or even just the results of genomic analyses, poses significant privacy risks due to the sensitive nature of genetic information. Genomic data reveals personal characteristics and familial relationships, which could lead to misuse or accidental disclosure. Many privacy-preserving genomics techniques have been developed as a result, each with its own set of challenges. The generation of synthetic genomic data has appeared as a potential solution to the privacy preservation of real genomic data sharing and publishing. However, existing methods face significant challenges in maintaining privacy and providing usefulness across different genomic datasets, such as PrivBayes, which primarily suffers from performance degradation when dealing with more than 100 nucleotides due to strong correlations.

This research study aims to generate synthetic genome data while balancing utility and privacy. For such a purpose, we explore the deep generative models trained with privacy in mind, and both general-purpose LLMs (e.g., GPT-2, LLaMA, Claude, Mistral ) and genome-specific models (e.g., DNA-GPT, Evo, Nucleotide Transformer) to assess their capabilities in genomic synthetic data generation. By examining the entire spectrum of generative and large language models —we aim to uncover strategies that maximize utility biological features while addressing critical privacy risks for individual and family-level relationships. Our synthetic genomic data generation and privacy assessment framework utilizes genomic data from the 1000 Genomes Project, explicitly targeting the CHB (Han Chinese in Beijing) and CEU (Utah residents of European ancestry) populations. Moreover, the research's primary objective is to assess whether Differential Privacy (DP) fine-tuning provides stronger privacy guarantees for the generated synthetic genomic data than non-DP fine-tuning settings. The initial research results of our assessment framework show that smaller models tend to lack utility and are more vulnerable to attacks. Moreover, genome-specific LLMs show lower privacy risk and higher utility when compared with the general LLM. More interestingly, DP fine-tuned models lower the privacy risk while maintaining the utility of the generated genome, indicating DP's effectiveness in protecting privacy without significant accuracy losses. In conclusion, our research demonstrates the new methods for producing synthetic genomic data, showing how important it is to fine-tune with or without DP to increase the usefulness of synthetic genomic data while maintaining its privacy.

Thesis Committee:

Internal Reader: Dr. Saeed Samet

Internal Reader: Dr. Pooya Moradian Zadeh      

External Reader: Dr. Mitra Mirhassani  

Advisor: Dr. Ziad Kobti

Vector Logo

Registration Link (only MAC students need to pre-register)