Light on substance, heavy on hot takes!
Hard stuff
Honestly I wasn’t that productive coding this week. I’m thinking about how to spark joy.
Watch stuff
I saw this excellent conference talk by the founder of Snowflake. If you’re curious what all the buzz is about, I’d highly recommend it. Some takeaways:
- Snowflake developed without an initial product for about 2 years.
- Snowflake data is fundamentally immutable. However its immutability is at the storage level; it presents tables as mutable representations. (Knit will encourage immutability but probably not require it.)
- Snowflake allows “time travel” or access to data history for 24 hours.
- Much ado has been made of Snowflake’s separation of data storage and compute, and their “virtual warehouses.” I don’t really see the huge hoopla (it kind of seems like a standard architecture), but it does give them a lot of buzzwords to highlight IMO their core go-to-market innovation, which is that pricing scales linearly with storage/compute (basically all the way to $0). This makes it really low risk for organizations to give Snowflake a try, and to justify “just one more” incremental adoption that snowballs (heh).
- Snowflake is bullish on its data marketplace, and in general on network effects of data sharing between organizations. Because of immutability and multi-tenant cloud infrastructure, Snowflake allows practically free cross-organizational data sharing regardless of data volume. I suspect this is a significant chunk of their $12B IPO valuation (now $77B market cap).
- Snowflake is a huge bet on the cloud, as its architecture leverages and relies on the cloud platforms. The differentiated advantages I see are very low friction onboarding (combined with their elastic pricing strategy), and enabling their data marketplace.
- Snowflake value prop is abstracting away infrastructure, optimization, and collaboration functionality.
Some gaps to observe:
- No story for on prem or hybrid.
- Not yet any Python or unstructured data support (but we can expect these soon).
- As a platform built fundamentally on data storage, other data integration tools are still needed to pull data in.
- So far Snowflake does not seem interested in orchestration and has been content to leave that to dbt and others (who have also gotten a boost from Snowflake tailwinds).
Relationship to Knit:
- On the surface Knit and Snowflake are complementary. Snowflake is fundamentally storage and Knit is fundamentally orchestration. Knit can be used to manage transforms on top of Snowflake, and fill in some gaps like hybrid cloud and data integration.
- Snowflake represents in a sense the ultimate opposite to Knit’s thesis. It is the ideology of the monolithic data warehouse (rebranded as data cloud) modernized and taken to its extreme. Despite being in somewhat different markets, Snowflake poses a risk to Knit if it is able to adequately service user pain points (data sharing, history, governance) with less capable data orchestration like dbt.
- If Snowflake made a huge push into orchestration that would be a big upset. There’s no sign of this but they have the resources.
- If they radically open their pricing with a free tier and their data marketplace, that could capture another market they are not currently targeting (non-commercial, hobbyist, open data). I think this would be brilliant marketing; they already dominate the commercial market, but this would let them also dominate mindshare (students, freelancers, researchers would all sign up for Snowflake too). I kind of hope they do because it would be amazing but hope they don’t because it would suck a lot of oxygen out of the room. I suspect they won’t because they’re going to be busy for years with their current sales traction.
News stuff
DVC released 2.0. DVC is conceptually probably the closest product to Knit. There are some fundamental differences, like DVC is built as a layer on top of Git, and it only supports processing local files. Over the last year, DVC has repositioned squarely into the ML space. The 2.0 features are pretty much all marketed towards ML, but would actually be useful for other data processing if they weren’t so specialized, like metrics support and their continuous integration package CML. Some new features are designed to overcome limitations of Git (experiments and checkpointing).
Some interesting takeaways from the Python developers survey.
- 92% of data science users use another language. Bash/shell is their third most popular language, just behind SQL.
- Half of developers use Python for data analysis (only a third of those users call themselves data scientists). Half of those users do data analysis as a secondary activity or hobby. Data analysis is 30% of developers primary use when combined with ML (more than web development).
- 77% use SQL databases, 41% use NoSQL databases.
- A third of developers use Python for web scraping (but only 4% as their primary use).
- Half of developers manage dependencies with virtualenv. A third manage dependencies with Docker. Half of developers who primarily use Jupyter notebooks manage dependencies with Conda.
- Gitlab CI is the most popular CI tool. A third of developers do not write tests (caveat: also a third of respondents have <1 yr industry experience).
- Making some inferences from the data, but of professionals who do some data analysis, about half primarily use Windows, and a quarter each primarily use Linux and macOS.
- The top most desired Python features in order are stronger typing, better performance, and better concurrency. (Arguably concurrency ~ performance, so most users want Python to perform better.)
Off-topic but the next most desired Python feature is pattern matching / switch statement. If you haven’t been following, there’s a brouhaha going over Guido’s latest pattern matching PEP.
I spent way too much time trying to analyze the CSV survey results in different BI tools. Jupyter notebooks probably make more sense if you can code.