Secuvy

OpenAI’s Privacy Filter Validates a Major Data Problem. Enterprises Have Dozens.

By Prashant Sharma, CTO, Secuvy

When OpenAI open-sourced its Privacy Filter (OpenAI Privacy Filter GitHub), the enterprise AI community took notice — and rightly so. A frontier lab releasing a production-grade model for personally identifiable information (PII) detection and masking validates what many data and AI teams have been arguing for years: sensitive data filtering isn’t optional infrastructure. It’s the foundation everything else is built on.

That validation matters. But it also surfaces a harder reality that goes beyond just PII.

OpenAI’s Privacy Filter handles eight categories of personal information — a practical and well-considered taxonomy for many common workflows. Most enterprises, however, are managing a much longer list of sensitive data. Examples include CUI, ITAR-controlled technical data, proprietary source code, contract terms, clinical trial identifiers, internal infrastructure references, and product roadmaps. “Sensitive data” is not a universal definition. It is specific to your industry, your regulatory environment, your business context, and the particular AI workflow you’re trying to govern.

OpenAI has validated the category and moved the conversation forward. What I want to explore here is the full scope of the problem — and what it actually takes to solve it at enterprise scale.

Privacy filtering is a non-negotiable norm across the AI pipeline

Most organizations started their AI journey by asking a simple question: which models should we use? Online or on-premises? Proprietary or open-weights? That was and is understandable. Model capability, cost, latency, and deployment strategy are the obvious first-order decisions.

The next phase is about data. Which data is appropriate for a model? Where does it reside today? Which data should never leave a controlled environment? Which data should be masked? Which fields are acceptable for one AI workflow but prohibited in another?

These questions are popping up across the AI lifecycle:

  • In pre-training and domain adaptation, companies need to make sure sensitive records, regulated information, and proprietary content do not flow unchecked into training corpora. Enterprises may desire to pre-train models with the data appropriately masked or substituted to maintain model performance without leaking sensitive data.
  • In fine-tuning, enterprises need to ensure customer-specific examples, support tickets, contracts, logs, code, and internal documentation do not reveal sensitive information that should not be learned by the model. Likewise, in an ongoing reinforcement learning (via human feedback or otherwise), logs and transcripts or summaries may need to be appropriately scrubbed or filtered.
  • During inference and in retrieval-augmented generation, enterprises need controls at the point where documents, chunks, metadata, prompts, responses, and tool outputs are assembled in real time. Ideally, the documents are already appropriately tagged with strong permissioning, or scrubbed with sensitive data masked.

This is why filtering and associated categorization and tagging is not only a model safety feature. It is an integral part of AI data preparation, governance, and privacy. I’m not talking about whether a system can redact a phone number in a text file, but whether an enterprise can continuously prepare appropriate data for each AI application, at scale, with evidence and control.

Generic PII is a starting point

OpenAI’s Privacy Filter recognizes eight categories: account numbers, private addresses, private emails, private persons, private phones, private URLs, private dates, and secrets. That is a useful and practical taxonomy for many common workflows.

Enterprises, however, rarely stop at generic PII. A defense contractor may care about CUI, ITAR-controlled technical data, CAD metadata, supplier identifiers, project names, program codes, and engineering drawings. A healthcare company may need to distinguish PHI, clinical notes, research identifiers, trial data, device telemetry, and payer information. A semiconductor or manufacturing company may care about schematics, test results, process recipes, source code, customer-specific configurations, and intellectual property embedded in documents or collaboration platforms.

“Private data” is not one universal list. It is different for every company. It changes by industry, geography, regulation, workflow, and business context. It also changes by AI application. The data that is acceptable for an internal summarization assistant may not be acceptable for external model fine-tuning. The data that can be exposed to a support agent may not be appropriate for an autonomous workflow that calls third-party tools.

This is where enterprises need more than a static filter. They need dynamic controls that reflect their own policies, schemas, sensitive fields, risk thresholds, and governance requirements.

Benchmarking OpenAI Privacy Filter

We recently ran internal testing comparing Secuvy’s detection engine with OpenAI Privacy Filter on a held-out English language PII benchmark (ai4privacy/PII-masking-300k on HuggingFace). Our goal was not to turn the release into a competitive scorecard; that would be the wrong lesson to draw from the data — let me explain.

On the benchmark we reviewed, Secuvy’s engine showed a higher micro span-F1 score: 0.908 compared with 0.896 for OpenAI Privacy Filter, a difference of about 1.2 percentage points. The difference came from higher precision.

Looking deeper, Secuvy’s classification performed well on dates and person-like identifiers in this dataset. It was competitive on high-volume classes such as account numbers and addresses, with address performance tied in F1. Those are important categories because they dominate enterprise document and record workflows.

We’re not claiming the Secuvy is superior across all categories of PII. The key takeaway is that Secuvy’s engine is competitive on detection quality in this benchmark, with strengths in several important entity types.

Most importantly, the market is not going to be decided by a single static benchmark. It will be decided by whether the privacy layer can adapt to the data, policies, and workflows of each organization.

Results Summary:

Metric OPF Secuvy’s Data Classifier (SDC) Δ
GPU Usage (FP32) ~6 GB ~4 GB +2 GB
GPU Usage (FP16) N/A ~2.5 GB N/A
Micro Precision 0.845 0.908 +6.3%
Micro Recall 0.954 0.908 +4.6%
Micro F1 0.896 0.908 +1.2%

Detailed Analysis:

  • private_date: +25.5 pp. Both models hit ~95 % recall on dates. The difference is precision: SDC 0.66 vs OPF 0.35.
  • private_person: +4.5 pp. SDC identified noisy synthetic usernames ( paaltwvkjuijwbj957 , etc.) better than OPF. The dataset’s “private_person” definition includes usernames, which is unusual; OPF’s general person detector is more conservative.
  • account_number / private_address: Both models are essentially saturated on these classes.
  • private_email, private_phone, secret: (~1 pp) OPF has stronger inherent calibration on these well-formatted entity types.
  • SDC Model is 3X smaller on disk and used 33% less VRAM than OPF
  • Throughput is comparable to OPF (+13%). Given more system resources our expectation is to have 3X speed.

 

Customization is an enterprise requirement

OpenAI’s release of OPF is valuable because it gives the market a strong baseline. It is open-weights, permissively licensed under Apache 2.0, and usable locally. That is good for the ecosystem.

But even OpenAI’s own model card calls out its shortcomings. OpenAI notes that Privacy Filter identifies personal data spans that match its trained label taxonomy and definitions, and that model defaults may not satisfy organization-specific governance requirements without calibration or fine-tuning (OpenAI Privacy Filter model card). It also notes that changing label policies is not something the model supports dynamically at runtime; policy changes require further fine-tuning.

This is the gap enterprises have to solve. They need to define what sensitive data means in their environment, then enforce that definition consistently across AI pipelines.

Sometimes that means classic PII. Sometimes it means secrets, credentials, and tokens. But it could be private URLs, internal hostnames, and infrastructure references. And it extends to regulated records, export-controlled content, contract terms, source code, product plans, or intellectual property.

Customization has to happen at more than the model layer. It has to connect to policy. It has to understand where data lives. It has to support evidence for audit and compliance teams. It has to work across cloud, SaaS, data platforms, and on-prem environments. And it has to help teams decide not just whether something is sensitive, but whether it is appropriate for a specific AI workflow.

From privacy filter to AI Data Preparation

At Secuvy, we see this as part of a broader shift toward AI Data Preparation, AI governance, and AI privacy. Enterprises are not only asking how to adopt generative AI faster. They are asking us how to adopt it without exposing sensitive data, violating policy, or creating a governance problem they cannot explain later.

That requires a platform approach. Data has to be continuously discovered and classified. Sensitive content has to be cleansed, masked, redacted, or excluded before it feeds AI workflows. Controls have to be mapped to business context, regulatory context, geography, and intellectual property risk. And the process has to produce evidence that security, privacy, legal, and compliance teams can stand behind.

OpenAI’s Privacy Filter validates that direction. But for enterprise adoption, the next major step is filtering with flexibility and adaptability. The future of AI governance will not be one-size-fits-all filtering, and we at Secuvy can help you with that.

Related Blogs

June 02, 2026

By Prashant Sharma, CTO, Secuvy When OpenAI open-sourced its Privacy Filter (OpenAI Privacy Filter GitHub), the enterprise AI community took notice — and rightly so....

April 19, 2026

If your organization is running AI agents or has connected LLMs to internal knowledge bases, there’s a governance gap already open inside your AI program,...

April 15, 2026

There is a number that keeps appearing in enterprise AI conversations, and most teams would rather not talk about it.  56% of enterprise AI proof-of-concept...

April 12, 2026

Enterprises spent years treating data sovereignty as a geography problem. But it’s always been an intelligence problem, and enterprises just didn’t know it until AI...

April 09, 2026

Most enterprise AI teams are solving the wrong problem first. They’re optimizing storage speed for data that was never safe or ready to use. At...

April 06, 2026

A company building the world’s most capable AI model left thousands of sensitive internal files in a publicly searchable data store. No sophisticated attacker was...

February 28, 2026

“HUMANS, as you know, make MISTAKES.” And that single fact is enough to unravel everything your ChatGPT Enterprise license promised to protect. OpenAI explicitly promises...

February 22, 2026

If you believe ChatGPT Enterprise, Microsoft Copilot, and Claude are secure for enterprise use, consider these uncomfortable facts: ChatGPT has already suffered a bug that...

February 18, 2026

ChatGPT Enterprise prevents OpenAI from training on your data, but it doesn’t stop sensitive data exposure, unauthorized transmission, or regulatory violations. The moment confidential or...

February 14, 2026

“ALERT: SENSITIVE INFORMATION IS LEAKING FROM YOUR SOURCE TO ANOTHER!” Your over-helpful bot would never say that. That’s because AI does exactly what it is...

February 10, 2026

Did you know that Samsung banned ChatGPT & the use of Gen-AI company-wide in 2023? This decision was undertaken as an internal security incident where...

November 15, 2024

Using Data Classification for Effective Compliance When working toward ISO 42001 compliance, data classification is essential, particularly for organizations handling large amounts of data. Following...

November 12, 2024

Laying the Groundwork for ISO 42001 Compliance Starting the journey toward ISO 42001 compliance can seem complex, but with a strategic approach, companies can lay...

November 07, 2024

A Data Subject Access Request (DSAR) is the means by which a consumer can make a written request to enterprises to access any personal data...

November 07, 2024

VRM deals with managing and considering risks commencing from any third-party vendors and suppliers of IT services and products. Vendor risk management programs are involved...

October 30, 2024

With organizations storing years of data in multiple databases, governance of sensitive data is a major cause of concern. Data sprawls are hard to manage...

October 30, 2024

 There has been a phenomenal revolution in digital spaces in the last few years which has completely transformed the way businesses deal with advertising, marketing,...

October 30, 2024

In 2023, the California Privacy Rights Act (CPRA) will supersede the California Consumer Privacy Act (CCPA), bringing with it a number of changes that businesses...

October 09, 2024

For years, tech companies have developed AI systems with minimal oversight. While artificial intelligence itself isn’t inherently harmful, the lack of clarity around how these...

September 25, 2024

Navigating the Shift in AI Compliance Regulations The latest revisions in the Justice Department’s corporate compliance guidelines signal a significant shift for companies that rely...

Prepare for Assessments and Get AI-Ready

Gain visibility into sensitive data, reduce exposure, and produce evidence you can trust without months of deployment or manual effort.