Data collaboration archetypes

There are a few ways to organize around data problems.¹

Centralized: data warehouses

Historically, the barriers to working with data meant you needed lots of resources, effectively limiting it to institutions. This is the most straightforward to monetize and still largely the shape of the data ecosystem today. Data vendors sell into organizations top-down, emphasizing uniformity and centralized administration. ETL systems consolidate all of our data into singular data warehouses. This attempt at highly structured systematization is usually an ongoing pipe dream and requires heavy coordination across an organization.²

Distributed: web3

Some have pushed back against these centralized systems and pursued the complete opposite through blockchains (and other Merkle trees) and distributed data stores.³ These come at great costs though. Distributed systems, while robust and scalable, are much less efficient.⁴ They rely on consensus of complicated protocols, making them time consuming to change and generally lacking agility to add new functionality.

Decentralized:⁵ community-supported data

Resourceful individuals have found interesting ways to build datasets in public.⁶ OpenFreeMap is developed and paid for by one person (of course built on other open technologies and datasets). Hosting costs have become manageable enough that motivated people can donate the storage and computation for non-trivial data projects.

By continuing to push down the economics of data, we can lower the barriers to allow motivated individuals to accomplish even more. The value of open source code eclipses even the most valuable corporations, and there’s no reason open data shouldn’t be even more valuable.

There are still complicated coordination challenges to overcome,⁷ but I believe this is the future of data. It can’t come soon enough.

A parallel can be drawn between each of these models of data organization and the three classic network architectures. ↩︎
Other inadequacies of the traditional data warehouse have been highlighted in explanations of data mesh. ↩︎
Examples of distributed storage technologies include IPFS and Filecoin. ↩︎
Inefficiencies of distributed systems have been highlighted by DuckDB, which is able to perform many analytics on a single machine that previously required a distributed computing cluster. ↩︎
Or “federated.” ↩︎
For example, the git-based data strategy used by Simon Willison. ↩︎
This is why I’m working on knit. ↩︎

Data collaboration archetypes

2024/10/30

Centralized: data warehouses

Distributed: web3

Decentralized:5 community-supported data

Decentralized:⁵ community-supported data