Around 2008, storage prices reached a downward inflection point and have been drastically reducing since. Technological advances, such as the rise of SaaS,  along with declining storage prices, have made it easy to collect unstructured data from multiple sources including customers, data brokers, and sub-processors. And businesses are storing this data at far greater levels than before to derive meaningful financial results.

Data being harnessed today can be categorized in these four buckets today:

Structured Standard databases (e.g. Oracles or MySQL) where data types and classifications are defined with strict definitions and are relatively easily discoverable via SQL based tooling
Semi-Structured Excel files, Graph Databases, CRMs like Salesforce, Servicenow, Infrastructure Configuration with shared secrets, etc.
Unstructured Data Documents, emails, files, Source code, Contracts, DPAs, etc.
Multimedia Audio, Video, Moving Gifs, 3D Autocad Drawings, AR-VR Data, etc.

Businesses need to make it convenient for users to buy and search for products through multiple touch points ranging from email, chat, phone, or text messages. This mixed-mode use of data collection makes governance a tedious process given that there is an exponential rise in unstructured data usage. About 80% of data collected today is unstructured and is the fastest growing category containing sensitive data.

Why is Data Discovery critical now?

Large Language Model (LLM) prompts provide an easy mechanism to combine and query unstructured, semi-structured, and unstructured data capabilities to businesses. As LLMs have memorization and contextual linkage capabilities for multiple data categories, sensitive data can easily be queried and misused for a variety of purposes.  Privacy, security, and legal teams need to have access to a data discovery tool that can search and correlate sensitive data across multiple categories. Once the data is found, data governance can be implemented.

The Contextual awareness and data inference capabilities of LLMs allow:

  • User Attributes can be derived easily from partial information – For example, a phone number can be extracted by prompting information – Tell me how many users live in San Francisco? Which of these users are male and have age greater than 30?
  • Bias Introductions – If underlying loan qualifier tickets data favor particular races, age genders, or zip codes, LLM analytics related queries for vetting new customers will be biased.
  • Information Misuse and Correction – LLMs do not offer straightforward mechanisms to update or correct underlying data making it extremely hard to prevent data misuse; e.g If SSN information is accidentally ingested to an LLM, there are no mechanisms to prevent the sharing and misuse of this information. Further, malicious users can combine personally identifiable information such as a social security number, birthdate, and name and address information to do broader financial harm, such as applying for new credit card accounts or claim social security benefits.
  • Nonexistent Data Retention/Erasure Capabilities – Current LLMs store word vectors that have thousands of associated/related keywords and attributes linked with any sensitive data. It is nearly impossible today to query and delete this data for a specific user or group of users.
  • Lineage for Sensitive Data Results – There are no tools today that adequately provide details about how the data has been derived or calculated. The amount of data vectors and multiple dimensions make it computationally expensive to offer data lineage for sensitive data to give some sense of scale.  The word “Cat” in Wikipedia has a vector representation of 300 in commonly available models. This implies that there are 300 associations with other words. To update/delete or take a look at the lineage of this word in Wikipedia, this is a lot of information to compute and study. Further, these 300 associations might have hundreds of other associations of their own, making it exponentially hard for all purposes.

Current Data Discovery Mechanisms for above Data Categories:

  • Unstructured Data: Cybersecurity tools are mostly pattern-based or use minimal AI to identify/classify sensitive data in unstructured sources as logs, emails or on the network.
  • Structured/Semi-Structured Data: Data governance platforms offer visibility and insights for sensitive data in structured sources as databases or partially semi-structured data for data warehouses such as Snowflake or MongoDB. Popular data governance tools offer policy based masking capabilities or access restrictions for sensitive data. Partial discovery and anonymization of few data attributes will not work for LLMs as users can still derive relationships, e.g. prompting for phone number based on secondary information such as a zip code, location, or past employer.

Essentially, businesses need a comprehensive view for each individual’s information to prevent any data misuse via LLMs throughout all sources within the company. Gone are the days where data can be encrypted and masked in a database, presuming that user’s identity information cannot be derived by secondary attributes like address, zip code, income range, or even the model and make of car.

How Secuvy’s Self-learning AI Data Discovery helps with LLMs and beyond

Secuvy uses context-driven, self-learning AI to create a multi-dimensional metagraph for an individual’s data. This metagraph focuses on contextual attributes as Residency, Purpose of Use and Security/Privacy Risk(s).  These attributes are self-learned from any and all data sources including unstructured and structured sources. The result is a comprehensive and abbreviated live UserGraph, which can help privacy & security teams derive data correlations, lineage and context for any data. This graph is self aware as it constantly is getting updated, discovering new sensitive data on its own based on contextual information so businesses do not have to manually write or update scan rules. Further, our unique data correlation capabilities  summarize all personal data associated with individuals automatically.

To summarize Secuvy’s Data Discovery offers:

  • 360 degree view of Sensitive Data for each individual across multiple sources
  • Correlation Graph to see all related data attributes for a user in a single place
  • Data classification based on purpose, use, access & more
  • Legal Obligations associated with an individual’s residency & consent

We started Secuvy with a vision to offer a holistic approach towards data security, privacy, and governance. We empower businesses to find, correlate, and remediate data changes for any kind of data category without having the burden to constantly monitor, create rules, or policies and not burdening the team with alerts and false positives for true positive sensitive data.

Our unique ability to understand and create context across multiple data categories provides a better view for legal obligations and security risk both which is invaluable for any business. We are the first in the industry to create this kind of data graph exclusively focused on privacy and security use-cases.  Please schedule a Demo with us for all your LLM Governance needs.