We Shut Down Snowflake - And Here’s Why
Snowflake is a powerful tool - and yet today, we're shutting it down.
Right tool for the right job
Not every application fails because it’s bad. Sometimes, it just wasn’t designed for the problem you're trying to solve. That’s the real story behind why we shut down Snowflake and moved on to Trino.
Snowflake is excellent at what it does - but our workloads, data formats, and business needs didn’t match what it was built for.
That said, if I were starting from scratch and wanted to succeed with Snowflake, here’s what I’d do differently:
How I’d build for Snowflake
If i where start from scratch - no legacy formats or ingestion chaos - I’d build around Snowflake’s strengths:
Ingest everything directly into internal tables. No VARIANT columns, no guesswork - just clean, well - modeled data.
I’d avoid semi-structured fields whenever possible and stick to Snowflake’s best practices for table design. That includes leveraging clustering keys and avoiding wide, undefined schemas.
If I had to stage data in S3, I’d make sure it’s at least in Parquet - ideally Iceberg. That way, Snowflake’s query engine has metadata and file structure to work with, enabling pruning and parallelism.
The goal is simple: give the query engine as much context as possible.
Whether you’re using internal tables or querying external ones, you need full control over schema evolution and file layout. That means tighter pipelines, stricter contracts between producers and consumers, and a willingness to invest in early structure.
Snowflake works best when it knows your data better than you do
It’s a powerful tool - but only if you shape the system around it.
But Let’s Be Honest: You’re Locking Yourself In
If you follow the “Snowflake-first” path - clean schemas, internal tables, structured ingestion - you’ll likely make it work. It will perform well. Your analysts will love it. Your dashboards will be snappy.
But let’s not ignore the cost.
You’re shaping your entire architecture around a single vendor’s way of thinking.
Still, I’ll be the first to say: if Snowflake fits your current size and shape, use it. Just go in with eyes open about what it means for the future.
Why Snowflake Didn’t Work for Us
When we began this journey, our goal was simple: replace Hive. But our reality was anything but.
Our pipelines weren’t cleanly structured. They resembled an evolved version of Apache - style logs: partially structured data at the top level, and a payload
field where users could shove whatever fields they wanted. No schema enforcement. No standard contracts. Just raw flexibility.
When we moved to Kafka, we paused to ask ourselves: Should we start enforcing schemas now?
We could have. But we didn’t.
The way our business works, data changes fast. Teams want to experiment, iterate, and move quickly - without being blocked by rigid structure
So we made a conscious decision: embrace semi-structured data and let producers iterate freely. That meant no schema enforcement and the freedom to evolve message formats independently.
Once we made that call, we had to choose a file format. We looked at Parquet - of course we did. Iceberg wasn’t even an Apache incubator at the time. But Parquet just didn’t make sense.
Why?
Because the payload
field, if stored as a long string inside Parquet, essentially doubled our storage footprint compared to plain compressed JSON. JSON also had one hidden superpower: human readability. You could just download a file and open it - no tooling required.
We knew we’d pay more in compute, but we saved significantly on storage. And with bronze-level data that gets scanned once and archived, storage wins
Our model was simple:
Bronze = raw (JSON, semi-structured, cheap to store)
Gold = modeled (structured, optimized, queried regularly)
And in practice, we found that bronze files were touched exactly once. They’re scanned to build gold tables, and then they’re done. Archived forever.
Was JSON the perfect solution? No. But given the tradeoffs we had, we believe it was the right call.
Enter Snowflake
So we went to Snowflake. We explained what we had: semi-structured data in JSON, stored in S3, with no schema enforcement.
They said: “No problem - use VARIANT
columns. We’ll parse it for you.”
And it worked. At first.
Our proof-of-concept looked great. We could query the data. Things were fast. Everyone was excited.
Then we turned on a few of our larger topics.
As the data volume grew, the scanned data exploded. Performance dropped. And that’s when Snowflake started to fail us.
Snowflake wasn’t broken - it was doing exactly what it was supposed to do. But the price for querying semi-structured external JSON? That came in the form of brute-force compute.
We talked to their team. The advice was clear - and exactly what we expected:
“You should really insert your data into Snowflake tables. That’s when we shine.”
They weren’t wrong.
But that’s not what we wanted to do.
Because doing that would have meant changing how our entire data pipeline worked, enforcing schemas everywhere, and writing everything directly into a system we weren’t fully in control of.
And we already knew: that didn’t fit our architecture - or our culture.
How Trino Gave Us Back Control
By the end of 2022, I knew we were in trouble.
Snowflake costs were climbing. Hive was still heavily used. We had tried adding Spark on EMR to offload some compute and reduce Snowflake usage, especially for parsing semi-structured data. But all it did was make our infrastructure more complex and harder to operate.
We were stuck in a patchwork of expensive and brittle systems. Nothing felt like the right tool anymore.
It was around the holiday season. I had some quiet time and started reading up on recent changes in Trino (formerly Presto). I stumbled across an article introducing fault-tolerant execution. That caught my eye.
I knew Presto. I’d heard of Trino. But I always thought of it as just another data warehouse honestly, I lumped it in the same category as Snowflake. Still, I had some time, so I got curious.
A few hours later, I had Trino running on my laptop, connected to our Glue catalog.
That’s when it hit me.
This wasn’t just another engine. This was the tool we were looking for all along.
A New Direction
I messaged my CTO immediately:
There’s this open-source project, Trino. It connects to Glue, reads from S3, it has security layers now, and supports fault-tolerant execution. I think this might be it.
In January 2023, we kicked off a proof of concept.
A few weeks in, it was obvious - we had found our new engine.
By spring, we opened Trino access to internal users. At first, there was some hesitation. I’d been the one who championed Snowflake. Then Spark. Now Trino? People had questions.
And they were fair to ask.
But once people started running queries - once people realized that in Trino, running more queries didn’t mean paying more - that the cost was fixed, not usage-based - the skepticism disappeared almost overnight.
Trino didn’t just work. It changed the economics of how we worked.
No surprise bills for running something "too long." Resources were shared. Autoscaling worked. And most importantly: performance was fast.
From Query Engine to Migration Tool
The moment we understood what Trino really was - a query engine, not a warehouse or a database—we saw how powerful it could be.
Trino could connect to:
Hive
Snowflake
S3 (via Glue)
Even our legacy systems
Suddenly, Trino wasn’t just a replacement - it became a bridge.
We used it to query data across systems, and to gradually migrate workloads out of Hive and Snowflake. Adoption happened fast. Within a few months, Trino was running at full speed across the company.
Yes, we had to bump up our resource caps a few times. But by then, we were already shrinking our Snowflake and Hive footprints. It was worth it.
Trino gave us optionality. It gave us leverage. And most importantly, it gave us clarity.
We finally had a system we could scale, understand, and trust.
But here’s the thing:
This story isn’t really about Trino.
And it was never just about Snowflake.
If it sounds like I’m criticizing Snowflake, I’m not. You could swap out the name in this story with almost any provide Trino, Databricks, BigQuery, ClickHouse - all of them can work brilliantly - or fail painfully.
The difference is never just the tool. It’s the alignment between the tool and your architecture, your business needs, and your constraints.
Snowflake didn’t fail us.
We didn’t fail Snowflake.
We simply had different priorities than it was built to serve.
Every tool is an opinion about how data should flow. Make sure it matches yours.
If you don’t fully understand what you’re building - and what your technology actually allows you to do - you’re not just running a risk. You’re making a bet you might not even realize you're placing.
That’s the story here.
Great read! We’re in a similar situation with similar choices and pitfalls so this was a good perspective!
Well said!
"Every tool is an opinion about how data should flow. Make sure it matches yours." and
"The difference is never just the tool. It’s the alignment between the tool and your architecture, your business needs, and your constraints."
I hope people read beyond the headline... lol