Most enterprise AI teams are solving the wrong problem first. They’re optimizing storage speed for data that was never safe or ready to use.
At NVIDIA GTC 2026, Jacob Liberman, NVIDIA’s Director of Enterprise Product, put it plainly: “People talk about data as a lake, but it’s more like a river. It’s flowing, it’s changing – constantly changing. So the first time you prepare your data is not going to be the last time. You have to continuously prepare your data for AI as it changes.”
That single statement captures the most underestimated problem in enterprise AI today. Not the GPU shortage, not model selection, not infrastructure cost; it’s the data preparation problem.
Fast Storage Fixes the Wrong Thing First
The GTC session – “Accelerating the Path to Production: The Evolution of Enterprise Storage to Deliver AI-Ready Data,” covered NVIDIA’s BlueField-4 STX architecture, partnerships with IBM, Dell, and NetApp, and the shift toward storage systems built for AI inference rather than human retrieval.
All of that matters. Fast, AI-native storage is a genuine requirement.
But Liberman also laid out what has to happen before data reaches storage infrastructure: extraction, enrichment, classification, embedding, indexing, and semantic search. Each step is resource-intensive. Each step runs continuously, or it fails.
And here’s the step most enterprise AI programs treat as an afterthought: classification. Not embedding. Not indexing. Classification – understanding what the data actually is, what it contains, and whether it belongs in the pipeline at all.
The Step Most Teams Skip
When enterprises build AI data pipelines, they focus on what goes in: volume, recency, and format. Can the storage system retrieve it fast enough? Can the vector database handle the query load?
The more important question rarely comes up: should this data be in the pipeline at all?
Enterprise data estates contain a mix of everything. Sensitive customer records from three years ago sit in a shared folder. Unreleased product documents were never flagged as restricted. Clinical trial data lives alongside general research files. ITAR-controlled engineering specs share a storage bucket with public documentation.
When an AI agent, RAG pipeline, or fine-tuning dataset pulls from that estate, it doesn’t discriminate. It retrieves what it can reach. It processes what it finds.
The result is both a security problem and a data quality problem. Unclassified, ungoverned, mixed-sensitivity data produces AI outputs that can’t be trusted, audited, or explained. And as Liberman noted, inference is increasingly where the value is created; put bad data into inference and you get bad decisions out.
Why “Continuous” Is the Critical Word
The most important thing Liberman said at GTC wasn’t about storage speed. It was about time.
Data doesn’t stay still. New files get created daily. Documents are modified, copied, and moved across systems. A dataset that was clean and appropriate last month may contain sensitive records this month, because someone added a new data source to the pipeline.
Static data preparation doesn’t solve this. A classification scan conducted when the pipeline was first built goes stale within weeks. New data arrives unclassified. Sensitive content drifts into storage locations where it shouldn’t exist.
Continuous data preparation means the classification layer runs alongside the data, not just ahead of it, once. It means new files are understood before they’re retrieved. It means when a document is modified or moved, its classification updates in near real time, not on a quarterly audit cycle.
That’s exactly what NVIDIA’s storage partners are building toward on the infrastructure side. The question for enterprise AI teams is whether their data intelligence layer keeps pace.
When the Pipeline Doesn’t Know What It’s Retrieving
Traditional enterprise applications were built around specific, bounded data. A CRM system holds CRM data. An ERP holds ERP data. Classification was simple because the data was already contained.
AI agents don’t work within those boundaries. They pull from file shares, SharePoint, S3 buckets, SaaS platforms, data lakes, and internal knowledge bases, simultaneously, at scale, in real time. Each of those sources carries a different mix of data: some appropriate for AI pipelines, some sensitive or regulated, most of it never classified.
When the pipeline doesn’t know what it’s retrieving, two things happen. Inappropriate data reaches AI systems, bringing compliance and security exposure with it. And the model works with noisy, low-quality data mixed alongside high-quality data; its outputs reflect that.
The enterprises that successfully move AI from pilot to production aren’t the ones with the fastest storage. They’re the ones who know what’s in their data before it moves.
What the Data Preparation Layer Actually Needs to Do
The classification and intelligence layer sits between raw data storage and the AI pipeline, and it’s where Secuvy operates. Using self-learning AI rather than pattern matching or manual rules, Secuvy continuously discovers and classifies enterprise data across cloud, on-premises, and SaaS environments. It understands what data is, where it lives, and identifies sensitive, regulated, or inappropriate content before it enters any AI pipeline, RAG system, or LLM prompt.
Critically, it does this continuously. As new data arrives, as files are modified, as pipelines expand to new sources, the classification stays current. The first scan isn’t the last.
This delivers two outcomes simultaneously. First, protection: sensitive and regulated data is identified and filtered before it reaches AI systems. The pipeline isn’t just fast; it’s safe. Second, optimization: duplicate files, outdated records, ROT data, and low-value content are removed from the pipeline. The data that reaches the model isn’t just safe, it’s the right data, improving accuracy and reducing wasted GPU compute.
Both outcomes depend on the same foundation: knowing what your data is, continuously, before it moves.
The Production Gap Isn’t a GPU Problem
The path to production runs through data preparation. And data preparation isn’t a one-time project; it’s an ongoing function that has to keep pace with a data estate that never stops changing.
That’s the river problem Liberman described. The answer isn’t a better bucket. It’s a system that understands what’s in the water and keeps understanding it as the water flows.
Secuvy continuously discovers, classifies, and prepares enterprise data for AI pipelines, protecting what shouldn’t go in, and surfacing the high-value data that should. See how the intelligence layer works at secuvy.ai