Secuvy

Large Language Models (LLMs), Data Privacy, and Security: What you Need to Know

What inspired the rise of LLMs and why now? 

We have witnessed the remarkable rise of generative AI, powered by vast amounts of pre-trained data within large language models (LLMs). Chat-based interfaces have democratized AI for the general public, which has led us to where we are today. 

The underlying LLM technology enables anyone to use prompts using natural language to elicit queries for a specific or intended response. These responses can range from generating text, code, performing tasks, and more. The “natural language understanding” aspect is similar to humans, which previously has not been possible. 

We have seen chatbots serve as amazing and effective stock market recommenders, customer support agents, legal case analysts, and even healthcare assistants for complex cases. Many of these entities have sophisticated personalities and conversations while retaining context to complete transactions. The current set of conversational AI is already helping businesses scale with routine tasks like generating invoices, automating returns, researching complex topics, building initial content, and more, saving hours and even days of human work. We are seeing the initial impact of how far automation can be leveraged and the intensity of complex tasks that can be handled with AI. Given these cost savings and efficiency, we will see a widespread use of AI chatbots, agents, and task handlers continue to increase.  

With these capabilities, customer data can be easily misused as private conversations. Activity and logs can now be combined with user cookies to allow privacy and security breaches within a simple prompt. As we do not yet have data governance controls for LLMs data protection, a malicious actor can request and receive sensitive data related to a business or person with relative ease.

How LLMs evolve with reinforcement learning with human feedback 

LLMs need to be trained on large amounts of data including text, audio, and video. Human input is needed to provide feedback for fine tuning the responses from LLMs, or provide a baseline for correct and incorrect answers. Current LLMs can be trained over hundreds of billions of text documents covering a wide range of situational information, creating a knowledge base that can answer a wide range of questions. This input data set varies and consists of source code, database schema, sales metrics, customer demographics, legal contracts, IT support tickets, patient chats, slack messages, etc. 

Today, 80-85% of data is unstructured, and many businesses lack tools to fully utilize this data type. With LLMs, we have an opportunity to effectively make sense of all of this rich unstructured information and can streamline efforts to use it for meaningful purposes. 

Due to auto-learning capabilities, LLMs can be constantly fed data. The amazing memorization capabilities of LLMs provide us with querying, recommendation, and problem solving capabilities, which have not been possible before. One of the key capabilities offered by LLMs is that of feedback based learning where responses can be augmented and improved over time. 

How Businesses interact with LLMs today

Business interact with LLMs in three ways today:

  1. LLM-hosted platforms directly using an LLM created and maintained by businesses with AI expertise, such as OpenAI. 
  2. Embedded apps via chat/conversational bots within a currently used platform like Google Docs or Office365
  3. Self hosting model – Either train an LLM from scratch or utilize an Open Source LLM like Alpaca, fine tune the weights and maintain a self hosted version. 

As of now, few companies have the cloud infrastructure and AI expertise to create these models from scratch and manage/maintain them for others to consume. Several cloud providers including Amazon Web Services, Google Cloud Platform, and Microsoft Azure are in the process of offering to host LLM based services. Smaller models that can be easily trained and run on a restricted compute environment including mobile devices are also being actively researched.  

Today, the majority of businesses use Embedded Application Workflows to interact with LLM prompts within their day-to-day. Tools include word processors, source code checkers, SQL query generators, email responders, customer support, meeting summarizers, trip planners, legal document analyzers, and more. Through these approaches, workflows/bots are able to ingest custom datasets and provide responses against them. 

Privacy/Security concerns introduced via LLMs

There are several data security and privacy concerns introduced with LLM usage in a business context. Our intention with this blog series is to provide an overview with guidance towards mitigation or remediation. 

The key areas are: 

Dark Data Misuse & Discovery

LLMs can consume any kind of input data. A few areas that need attention include dark data in files and emails, orphaned database tables, IP data created by off-boarded employees, privacy data and confidential information about the business itself, and more. Any dark PII unknowingly used to train or query can cause severe, unintended consequences resulting in financial harm and loss of reputation. Dark data, including PII, is a major problem as LLMs can create associations with published data, creating opportunity for data breaches and leakage. Data poisoning or unintentional biases can easily occur as businesses have poor visibility on what data is used to input or provide feedback for LLMs. 

Biased Outputs

Businesses need to be vigilant about using LLMs for activities prone to biases e.g. analyzing resumes for employment fit, automating customer service needs for low income vs high income groups, or forecasting healthcare issues based on gender/age/race. A major issue in AI data training today is due to unbalanced data where one category of data is overwhelmingly dominating other categories leading to bias or incorrect correlations. A typical example would be any dataset with race, age, or gender distributions. Any kind of unbalanced data in these areas could lead to unexpected, unfair outcomes. And if the LLMs are trained by third parties the degree of bias due to these factors is unknown to the LLM consumer. 

Explainability & Observability Challenges 

For the current set of LLMs hosted publicly, there are only a few prompts available to tie in output results to known input. LLMs can “hallucinate” to create imaginary sources, making observability a challenge. For custom LLMs, businesses can inject observability during training to create associations during the training phase of an LLM. And then it would be possible to correlate the answers to that of the underlying sources to validate the output. Businesses need to have set up bias measurement and monitoring to ensure that the output of LLMs does not lead to harm or discrimination in these scenarios. Imagine harm caused by a LLM-based medical notes summarizer producing different health recommendations for males versus  females. 

Privacy Rights & Auto-Inferences 

As LLMs ingest data, they can create inferences with any personal information categories being provided as customer service records, behavior monitoring, or products considered. Businesses need to ensure that they have appropriate consent as a processor or sub-processor to derive these inferences. It is incredibly hard and expensive for businesses to keep track of privacy data rights and restrict usage in the current setup. 

Unclear Data Stewardships

Currently there are no easy, efficient ways for LLMs to unlearn information. The way businesses are using sensitive data as processors or sub-processors makes data stewardship complex to manage. This increases the legal obligations significantly for businesses. For security teams, data inventory, classification, and automation is crucial to design adequate safeguards for AI systems input and output responses. Input data into LLMs for training or prompts need to be filtered to ensure that information used is identified within the scope, for the purpose of use. 

Next up: Improving Data Security and Privacy for LLMs

Given these challenges with Large Language Models data security, security teams’ surface areas have increased exponentially and it is much more critical to ensure that LLMs are being used safely and effectively. In the following blogs within this series, we will discuss how to improve privacy and security for  LLMs for the following topics:

  • Data Discovery: identify risks, detect bias in unstructured, semi-structured and structured data
  • Data Classification: establish LLM explainability KPIs, improve data insights based on purpose, residency and scope.
  • Setup AI Governance Automation for:
    • AI risks posture, preventing bias failures
    • Minimize data leaks and automate data security workflows 

Tune in for blog #2 focused on Data Discovery and LLMs.

Related Blogs

Best Practices for Data Classification in ISO 42001 Compliance

Using Data Classification for Effective Compliance When working toward ISO 42001 compliance, data classification is essential, particularly for organizations handling

Getting Started with Data Classification for ISO 42001 Compliance: A How-To Guide

Laying the Groundwork for ISO 42001 Compliance Starting the journey toward ISO 42001 compliance can seem complex, but with a

A Comprehensive Guide To Data Subject Access Request (DSARs)

A Data Subject Access Request (DSAR) is the means by which a consumer can make a written request to enterprises

Understanding AI Compliance: Key Insights for Businesses

Navigating the Shift in AI Compliance Regulations The latest revisions in the Justice Department’s corporate compliance guidelines signal a significant shift for companies that rely on AI technologies. Secuvy’s dedication

Role of Data in Ensuring Data Security

Introduction The threat landscape around data security evolves each year due to factors like a lack of robust security measures, improper data handling, and increasingly sophisticated cyberattacks. With data growing

2023 Data Security Breach hacker

Lessons Learned From Massive Healthcare Cyberattack

It’s not often a cyberattack affects a substantial portion of Americans. In early 2024, UnitedHealth Group confirmed a ransomware attack on its subsidiary, Change Healthcare, resulting in a significant theft

Unstructured Data

A Structured Explanation of Unstructured Data

A Structured Explanation of Unstructured Data To corporate Privacy teams, the term “unstructured data” is frequently thrown around. Yet truly understanding what it means, and therefore knowing its implications and

Unlock the Power of AI for Data Privacy Observability

In today’s data-driven world, where privacy concerns loom large and regulations become increasingly enforced, Privacy teams face the formidable task of safeguarding their organization’s sensitive information while ensuring compliance with

AI data governance for ethical use

AI Data Governance for Fair Decision-making

AI Data Governance for Fair Decision-making Today Artificial Intelligence (AI) is a part of our day-to-day activities, and knowingly or unknowingly, it impacts our actions and decision-making. With the growing

Navigating New SEC Rules: The Ultimate Playbook for CISOs

In a significant development, the Securities and Exchange Commission (SEC) implemented new rules effective December 2023, aimed at enhancing transparency and consistency in the disclosure of cybersecurity incidents by registrants.

Seven Foundational Steps to Comply with DPDPA

Seven Foundational Steps to Comply with India’s DPDPA

The Digital Personal Data Protection Act (DPDPA) marks a significant milestone in India’s legislative history, culminating after years of negotiations and false starts. Enacted to safeguard the rights of India’s

What the SEC and SolarWinds are Saying to CISOs

In short, the pressure on CISOs to proactively manage and report incidents has reached a tipping point. In July, the SEC (Securities & Exchange Commission) approved a long-awaited cybersecurity framework

Universal Data Governance: How to Set Up One?

An organization can make use of universal data governance to stay ahead of its competitors. When governed in the right way, the collected data can help in identifying quality data,

Data Privacy Laws

U.S. Data Privacy Laws: What to Expect in 2023

Currently, over 130 countries have international data privacy laws in place to safeguard individuals’ data; with more countries, regions, and localities adding new ones in the coming years. These laws

CCPA

Is CCPA Applicable to your Business?

California Consumer Privacy Act (CCPA) came into implementation from Jan 1st 2020. In this blog post, we will talk about how it came into effect and how this law affects

ccpa-gdpr

CCPA vs GDPR

CCPA vs GDPR Regulation GDPR CCPA Enforcement Date May 25th, 2018 Jan 1st, 2020 Who needs to comply Any Business that collects or processes the data of EU citizens and

Global Privacy Laws

Global Privacy Laws

List of Global Privacy Laws European UnionGeneral Data Protection Regulation (GDPR) USCalifornia Consumer Privacy Act (CCPA) CanadaPersonal Information Protection and Electronic Documents Act (PIPEDA) BrazilGeneral Data Privacy Law (LGPD) ArgentinaPersonal

Secuvy-GlobalMap

Overview of Data Privacy Laws in 2022

What 2022 Entails for Data Privacy As much as digitalization is sweeping the world in a wave, there is an increasing need to secure and protect the volumes of data

ccpa

Data Discovery as the foundation for CCPA Compliance

Introduction Today, data is power. Qualitative, meaningful insights derived from collected data give businesses an edge in competition. However, consumer information privacy and the protection of their personal information has

ccpa and cpra

CCPA/CPRA Fines and How to Avoid These

CCPA/CPRA Fines and How to Avoid These The California Consumer Privacy Act of 2018 and the California Privacy Rights Act of 2020 are two privacy laws that aim to protect

Classification for Data Inventory and Mapping

How Does Data Classification Help with Privacy and Security?

Compliance regulations demand critical consideration for organizations that engage in a data-driven business. The foundation for this lies in data classification. AI-automated data classification is an efficient, cost-effective, and sustainable

Privacy Data Governance program

Privacy for Data Governance Teams

According to the Aureus Analytics report, it is estimated that the world’s data volume will grow at 40% per year from 2021 to 2026. Data has been recognized as a

GDPR and Consent

Data Subject Rights: Types of Consent

Today, there is high adoption of digitalization. Transactions of all natures happen online, leaving behind data trails, letting companies “store” or retain user data with them for marketing or value-addition

data breaches

Mitigate the Impact of Data Breaches in the Cloud

In today’s world, data is everywhere. From individuals to big organizations, data is used by everyone on an everyday basis. This data might include sensitive information, such as personal details,

gdpr data mapping

How To Prevent Data Breach With GDPR?

Within the past few months, the risk of data breaches for Universities and Colleges has increased to a great extent. The GDPR and UK Data Protection Act have come into

Does GDPR Apply to US Citizens? Here’s the Answer

General Data Protection Regulation, popularly known as GDPR, is European legislation associated with data privacy. Passed in 2018, the privacy legislation has revolutionized the modern digital landscape. As a part

Top 10: Most Common Types of Cyber Attacks

A cyber attack can be defined as a malicious, deliberate attempt to target one or multiple computer systems. Individuals behind the offensive action use different ways to steal and destroy

Find Sensitive Data in Google Drive and Gmail

How Secuvy AI Automates Data Discovery for GSuite Gmail and GDrive are the paramount emails and cloud-based services worldwide. Secuvy Al proposes peerless support to the sensitive data and personal

California Consumer Privacy Act (CPRA)

CCPA vs CPRA

CCPA vs CPRA While Californian businesses are still coping with becoming compliant with the California Consumer Privacy Act (CCPA), the government has implemented another privacy law – California Privacy Rights

New Zealand Privacy Act 2020

The long-awaited amendment in the New Zealand Privacy Bill, which proposes amendment in the Privacy Act 1993, finally got a green flag in the parliament in June this year. The

Brazil LGPD Privacy Law

In September 2020, Brazil finally implemented its General Data Protection Law or Lei Geral de Proteção de Dados (LGPD). While Brazil already has 40 sectoral privacy laws at the federal

Ready to learn more?

Subscribe to our newsletters and get the latest on product updates, special events, and industry news. We will not spam you or share your information, we promise.

Career Form

By subscribing, you consent to the processing of your personal data via our Privacy Policy. You can unsubscribe or update your preferences at any time.