12 Comments
User's avatar
Landon Robinson's avatar

Great read! We’re in a similar situation with similar choices and pitfalls so this was a good perspective!

Expand full comment
John Y Miller's avatar

Well said!

"Every tool is an opinion about how data should flow. Make sure it matches yours." and

"The difference is never just the tool. It’s the alignment between the tool and your architecture, your business needs, and your constraints."

I hope people read beyond the headline... lol

Expand full comment
Babak Tourani's avatar

This is an excellent write up and it makes sense. It also resonates with me, a lot. We are a Snowflake shop who have managed to keep our financial footprint controlled and thus, small. A Small warehouse has been enough for us for years, but I definitely see your point about paying the cost for the bronze data that’s gonna be ingested once.

I’m not familiar with Trino, though. Could you please explain the point about having a “fixed, not usage-based” cost? I know that you have to provision a stable compute resource for Trino (let’s say an EC2 node) but the cost of that would be associated with the volume and the types of queries that you run, right? So if you’re running an aggregation query on data in S3, surely the scope of the query dictates how big the EC2 instance should be. Same argument about the number of people who concurrently use Trino. So on paper, you pay a fixed cost for Trino but then you have to manage the size of that instance vs the workload, and that introduces an influencing “variable” element into the cost and you can’t call it “fixed”, right? I know that cost (excluding the Total Cost of Ownership) could be smaller than some specific Snowflake warehouse size that runs 24*7, but “fixed?”

Expand full comment
Arturas Tutkus's avatar

Thank you for your time reading the post. I'm happy you enjoyed it :)

What I mean by fixed cost: Yes we provision EC2 instance for our workloads and we keep it running 24/7 - we know how much it will cost us. You could argue what even Snowflake's WH hourly cost is bigger compared to EC2 you still would benefit because of serverless (you only pay for what you use) - in our use case we needed cluster be running almost 24/7 and because of it - raw EC2 was cheaper.

Expand full comment
richard's avatar

Sorry for being a BigQuery defender :-)

Additionally, it's worth noting that Snowflake isn't truly "serverless". While there are many issues with BigQuery, it is the only one that genuinely claims the title of serverless. Like Gmail, BigQuery is always running; you share the service with millions of other users. Google has specially designed ASIC chips for BigQuery hashing. That’s why Google is the only vendor offering 1 TB of monthly data processing for free on BigQuery, without a credit card. This is the reason why students all over the world use it. Just like Gmail, it’s always running and ready. Queries can scale from one to thousands of computers; the system only cares about the amount of data processed, not the time it takes. In contrast, Snowflake is designed to fit within a fixed-size box to help control costs, Snowflake runs on any cloud.

Nowadays, Microsoft Fabric and AWS Athena have both entered the serverless space, but it's puzzling why they aren't more generous in sharing the extra capacity of their large computer clusters. I hope that one day, Microsoft Fabric and AWS Athena will become more generous towards the world.

Expand full comment
Arturas Tutkus's avatar

BigQuery - didn't fit our requirements...

Expand full comment
Lucius Benski's avatar

Great post and wise words! Just one question where the post left me wondering: What exactly does Trino do better than Snowflake when it comes to semi-structured (JSON) data?

Expand full comment
Rif's avatar

I’ve been looking into this question, since great write-up. I think Snowflake isn’t particularly good at automatically splitting a single large JSON file across multiple threads for parallel parsing on single box, whereas Trino excels at handling large JSON files. It performs even better when working with JSON Lines/NDJSON (https://jsonlines.org/).

Trino is designed as an in-memory system, and I suspect Snowflake takes a more conservative approach, likely due to its architecture being optimized around its own storage format.

It’s interesting that Trino, built on a Java-based architecture, outperforms a C++ database in this case especially considering that C++ with SIMD capabilities should theoretically have an edge. Yet Trino clearly solved the problem here.

Expand full comment
Arturas Tutkus's avatar

Lucius - great question! I would say nothing :)

Expand full comment
richard's avatar

Thank you for the detailed and transparent write-up. It's refreshing to see real-world experiences shared so candidly. I'm going to read up more on Trino and its handling of raw JSON, as it sounds like you're saying the differences in capability compared to Snowflake are significant. I really appreciate the time and effort it takes to put this kind of post together. Hope you don't mind me asking a few questions.

In your post, you mention:

"As the data volume grew, the scanned data exploded. Performance dropped. And that’s when Snowflake started to fail us."

Performance Metrics: It seems that Snowflake met your expectations during the proof-of-concept phase but encountered issues in production. Was this due to non-linear scaling i.e. from 1TB of JSON on S3 to 10TB? Did things slow down dramatically as the dataset grew?

Insights on Performance Metrics: Could you share any averages, such as the size of the JSON files, the number of files typically involved in a user query, or other relevant metrics?

Snowflake Marketing vs. Customer Reality: Snowflake markets features like External Tables and Apache Iceberg support, suggesting robust data lake capabilities. But can we genuinely say "Snowflake didn’t fail us" when its marketing emphasizes seamless handling of semi-structured data like JSON, unless Snowflake explicitly cautioned against using VARIANT columns for such data after PoC? I think what you're wanting the system to do is very common, not out of scope of the Snowflake roadmap.

What You Wish You Knew: What information would have helped you make a more informed decision before adopting Snowflake? Limitations around semi-structured data that, if disclosed upfront, might have influenced your choice?

Further Reading: Do you have any reading material, blogs, you’d recommend on either Trino with JSON or Snowflake’s performance nuances in data lake scenarios?

Thanks again for sharing your experience; it’s a conversation a lot of us benefit from.

Expand full comment
Arturas Tutkus's avatar

As I wrote Snowflake didn't failed us. Snowflake didn't lay - we ask can we use semi structured data and answer was yes - and it was correct answer. Also Snowflake DO NOT HAVE limitations about semi structured data - all engines will work same.

After we increased our data post POC. Snowflake - would continue to handle it - but it would require increase in WH.

After we did calculation how much resources we would need we understood we can't proceed with Snowflake and need to change how much we spend. Trino is also not free and it also have same limitations of semi structured data as Snowflake. Main difference is we do not need to pay premium so economics is better.

What would have helped us - if back when I would understand non of software can change how physics works..

Expand full comment
richard's avatar

Thanks so much for replying. Given this is a KAYAK-sized system, I don’t see long-term value in the JSON file format. By providing your users with tools to read the binary, you’ll address what I think is a cultural issue, which can be hard. You could save a lot of energy and money by switching to Avro, MessagePack, or Protobuf. Companies like Google will drive internal system improvements using these formats.

We'll have to see how Trino deals with learning the lessons of Databricks and Spark: building on Java data processing is just slow, and Databricks is rushing to fix that right now, with the Databricks Photon Engine.

Expand full comment