ControlNET
A Firewall for RAG-based LLMs
City University of Hong Kong
China Southern Power Grid
Abstract
Retrieval-Augmented Generation (RAG) has significantly enhanced the factual accuracy and domain adaptability of Large Language Models (LLMs). This advancement has enabled their widespread deployment across sensitive domains such as healthcare, finance, and enterprise applications. RAG mitigates hallucinations by integrating external knowledge, yet it introduces privacy risk and security risk, notably data breaching risk and data poisoning risk. While recent studies have explored prompt injection and poisoning attacks, there remains a significant gap in comprehensive research on controlling inbound/outbound query flows to mitigate these threats. In this paper, we propose an AI firewall, ControlNET, designed to safeguard RAG-based LLM systems from these vulnerabilities. ControlNET controls query flows by leveraging activation shift phenomena to detect malicious queries and mitigate their impact through semantic divergence. We conduct comprehensive experiments using state-of-the-art open-source LLMs—Llama3, Vicuna, and Mistral—across four benchmark datasets (MS MARCO, HotpotQA, FinQA, and MedicalSys). Our empirical results demonstrate that ControlNET is not only effective but also harmless. It achieves an AUROC exceeding 0.909 for risk detection, with minimal degradation in Precision and Recall, both of which show reductions of less than 0.03 and 0.09, respectively, for risk mitigation. Overall, ControlNET offers an effective, harmless, robust defense mechanism, marking a significant advancement toward the secure deployment of RAG-based LLM systems.
What's ControlNET?
ControlNET is a novel AI firewall designed for RAG-based LLM systems that detects and mitigates privacy and security threats by analyzing neuron activation shifts. It identifies malicious queries through distinct activation vector patterns and redirects model behavior to prevent harmful responses, enabling secure and privacy-preserving interactions.
Experiment Results
We conduct comprehensive experiments using state-of-the-art open-source LLMs—Llama3, Vicuna, and Mistral—across four benchmark datasets (MS MARCO, HotpotQA, FinQA, and MedicalSys). Our empirical results demonstrate that ControlNET is not only effective but also harmless. It achieves an AUROC exceeding 0.909 for risk detection, with minimal degradation in Precision and Recall, both of which show reductions of less than 0.03 and 0.09, respectively, for risk mitigation. Overall, ControlNET offers an effective, harmless, robust defense mechanism, marking a significant advancement toward the secure deployment of RAG-based LLM systems.
T-SNE visualizations of hidden state activations for unauthorized access and conversation hijacking, respectively, illustrating distinct clustering between benign and malicious queries due to the Activation Shift Phenomenon.
Future Work
Future work will focus on extending ControlNET to LLM agentic networks. These environments introduce new security challenges, such as multi-agents interactions, which current models do not fully address.
Citation
1 | @article{yao2025control, |