Hello Knit neophytes! This came out pretty hefty. I think future issues of Knit picks will become briefer as more context is built up.
Soft stuff
Had a lot of high quality conversations this week. It’s fun to see people get excited about the potential (and feels like it karmically balances out some dud conversations I had the week before). In terms of pitch refinement, some Knit points that seem to resonate are:
- Anything we do with computers is fundamentally processing data. It’s all bits and bytes. The data problem space can be cast much wider than “analytics.”
- Notebooks, pipeline code, graphical DAGs! There is no universal UI for data pipelines in all situations.
- Snowflake, Looker, etc are in the business of capturing value, so they focus on building large (ie expensive) offerings that are somewhat exclusionary. Knit wants to commodify data orchestration.
Maybelline
Knit is not “born with it.” How do we make it zing? Some thoughts I’ve had or heard:
- Complex pipelines with layers of composition (e.g., line of therapy calculations used in multiple disease modules).
- Fork it with Knit. Push button reproduction of existing data pipeline, without regenerating any data or setting up infrastructure.
- How it was made. Publish a data analysis using Knit across multiple tools or languages. A very small example is this Covid dashboard built in the open (code, daily CI). Note: not written in Knit, code hasn’t been updated in almost a year (but data updates still chugging!)
- r/dataisbeautiful encourages remixing. Usually people use the same dataset or aesthetic style, but don’t share any code. If someone wants their work to be remixed, Knit could be an easy way to allow that.
Tech stuff
Did the hard parts of splitting Knit into frontend / backend layers. Quick recap: the backend manages dependencies and executes things, the frontend can be a GUI, notebook, templated code (a la dbt), etc. This took a little while; more on that in a sec.
Once that’s finished up it should be straightforward to run pipelines that use multiple frontends; so I’ll be experimenting with that. Also planning some quality of life improvements for early users (error messages and the like).
Tech spew warning
Refactoring the frontend / backend touches one of the more conceptually tricky parts of Knit. Let’s follow this digression! I think of it as dynamic flow composition. That means any job in a Knit flow can run another Knit flow inside of it. Taking it a step further, the steps in that nested flow can be different depending on data from the parent flow. A meta way to think about it is that Knit flows are data that Knit itself can process.
Knit uses dynamic flow composition to implement a couple features. The most straightforward is running multiple user-defined flows together, such as to combine different frontends. It is also used for partitioning/multiplexing, so you can run the same data analysis over subsets of the input. Some tools can statically compose flows, but this won’t add partitions as your data grows. Other tools rely on partitioning at the data storage level (dbt, Snowflake, Spark), but these usually can only be used with one data storage system.
In the future, I expect dynamic flow composition to be used for other features like packaging and using versioned flows.
I haven’t heard of anything that does dynamic flow composition like Knit. It might be a little crazy. It’s possible it will be hidden as an implementation detail.
Phew that was a trip. I’d be curious if this type of differentiation seems meaningful or if it just sounds like implementation gobbleygook. Or if you just liked clicking the pictures.
OK that’s enough for this edition. Stay classy!