Presenters

Martin Kleppmann

Source

InfoQ podcast

The Future of Data: Navigating Cloud-Native, Decentralization, and Local-First Architectures 🚀

The tech landscape is evolving at breakneck speed, and the decisions you make today about AI adoption, system architecture, and team collaboration will echo for years to come. It’s a challenge to get these calls right when everything is in flux. That’s where events like QCon San Francisco come in, bringing together senior engineers with practitioners who are a few steps ahead. This November, 60+ speakers across 12 tracks will share what’s actually working in production – no pitches, just real-world insights. Learn more at qconsf.com.

Today, we’re diving deep into the fascinating world of data systems with Martin Kleppmann, the renowned author of “Designing Data-Intensive Applications.” His second edition, recently launched, reflects the significant shifts in technology and his own evolving understanding.

The Cloud-Native Revolution in Data ☁️

Martin highlights a major evolution since the first edition of his book: the rise of cloud-native software architectures. Historically, building distributed databases meant writing software that managed data on local disks. Replication was handled at the database level.

Now, systems are increasingly built on top of cloud services. Databases might leverage object stores as their underlying storage abstraction, which are inherently replicated. This fundamentally changes how developers approach building data systems, moving away from local OS services to cloud-provided ones. While traditional databases still exist, this cloud-native approach offers a new paradigm.

From Monoliths to Modular Building Blocks 🧱

The data space is experiencing a revolution, moving away from monolithic blocks towards a more fragmented and composable architecture. For a long time, the choice was binary: build a massive system or buy one. Think of the era of huge data lakes in BI.

Now, we see a trend of composing different building blocks. Martin points to examples like the FDAP stack (Apache Flight, DataFusion, Parquet, Arrow) used by InfluxDB, enabling easier construction of data systems. Technologies like S3 buckets have become de facto standards, and Apache Parquet serves as a foundational format for data lakes and analytics. This creates a more flexible and experimental environment, where developers can mix and match components to suit their specific needs.

The SQL vs. NoSQL Evolution Continues 🔄

The SQL vs. NoSQL debate, once a major revolution, has evolved. Martin shares insights from QuestDB, where a multi-tiered approach is gaining traction. This involves:

A write-ahead log for NoSQL-style writes, allowing for simple, unconcerned data ingestion.
A SQL-based query engine that can query this data, which might not have been considered SQL-queryable before.
Cold storage using Apache Parquet files on a file system, which are still queryable and easily archived or backed up.

This offers a wealth of options, potentially too many options for some when choosing a system.

Decentralization: Empowering Autonomy and User Agency 🌐

A significant theme Martin explored is how distributed systems can foster greater autonomy and sovereignty, reducing vendor lock-in. This is particularly relevant in the context of social media and collaboration tools.

Martin played a role in building the AT Protocol, the foundation of the Blue Sky social network. The initial vision was to create a decentralized technology layer for social networking apps, aiming for a user experience indistinguishable from centralized platforms.

This contrasts with approaches like ActivityPub (used by Mastodon). While ActivityPub maximizes federated decentralization, it can lead to inconsistencies in user experience (e.g., different reply threads on different servers). The AT Protocol, however, prioritizes consistency. It achieves this through a firehose that aggregates activity from all servers, allowing different organizations to build indexing services.

Key Trade-off: While the AT Protocol aims for decentralization, the current implementation of Blue Sky is somewhat more centralized due to its reliance on services provided by the Blue Sky company. However, the core principle remains: users should be able to switch service providers without losing data, username, posts, or their social graph. This portability is a crucial aspect of user empowerment.

Local-First: Bringing Data Back Home 🏠

The Local-First movement champions a different kind of decentralization, focusing on collaboration software for small groups of trusted users. The core principle is that the primary copy of the data resides on the user’s machine, not in the cloud.

Benefits of Local-First:

Offline Access: Use software seamlessly without an internet connection.
Performance: Faster interactions as network round trips are minimized.
User Agency & Empowerment: Reduces reliance on cloud providers and mitigates the risk of services being shut down (think Google Reader). Users retain control over their data.
Data Portability: Even if servers disappear, users still have a copy of their data.

Analogy to Git: Martin draws a parallel to Git, a local-first system where the entire commit history is local. While centralized platforms like GitHub and GitLab exist, the Git protocol itself remains open, allowing for multiple remote repositories. This open standard is key to avoiding lock-in, a principle local-first aims to emulate for broader applications.

Challenges and Considerations:

Complexity: Peer-to-peer synchronization is inherently difficult to make reliable, often necessitating cloud services for syncing across devices.
Business Logic: Retrofitting existing apps that rely heavily on server-side business logic to a local-first model can be challenging. It often requires shifting that logic to the client-side.

Automerge: A Local-First Library 🛠️

Martin’s collaborators work on Automerge, an open-source library for building collaboration software. Implemented in Rust for portability and compiled to WebAssembly (WASM) for web use, Automerge provides a data model for applications, enabling:

Automatic Synchronization: Data syncs between machines.
Real-time Collaboration: Seamless co-editing.
Version Control: Branching, diffing, and merging capabilities.

Automerge supports various file types, including text, spreadsheets, graphics, and even CAD software. While JavaScript is a common way to use it via WASM and TypeScript wrappers, native bindings exist for Swift, Java, Python, Go, and C.

Getting Started with Automerge: A simple to-do list app that syncs across devices is recommended as a “hello world” project. This demonstrates Automerge’s ability to provide cross-device synchronization, a feature not native to standard JavaScript frameworks.

Limitations of Local-First ⚠️

Local-First is best suited for creation apps where users directly edit data (e.g., spreadsheets, graphics, documents). It’s less ideal for applications where an authoritative server copy is essential, such as:

Banking Applications: Local edits to a bank balance don’t reflect actual funds.
Online Shops: Downloading entire product catalogs can be inefficient and impractical. Customers don’t typically edit product descriptions.
Physical Resource Tracking: Stock levels in a warehouse are inherently centralized and tied to physical reality.

Contributing to the Local-First Movement 🤝

Contributing to the local-first ecosystem can involve:

Building Apps with Automerge: Experimenting with the library to create local-first applications.
Exploring Other Libraries: Resources like local-first.fm offer comparisons of various libraries.
Diving into Infrastructure: Contributing to open-source projects for performance improvements, new synchronization protocols, end-to-end encryption, and decentralized access control.
Joining the Community: The Automerge Discord server is a great place for communication and getting questions answered.

The Local First Conference in Berlin is another excellent venue for connecting with the community, sharing ideas, and exploring different implementations. This year’s conference broadens its scope to focus on user agency and empowerment, themes that resonate deeply with the AT Protocol’s mission of reducing user lock-in.

The future of data systems is moving towards greater user control, flexibility, and modularity. Whether it’s cloud-native architectures, the composability of data tools, or the empowerment of local-first principles, the landscape is rich with innovation and opportunity for engineers to build more resilient, user-centric systems.

Increasing Users’ Data Agency: From BlueSky's AT Protocol to the Local-First Software Movement

The Future of Data: Navigating Cloud-Native, Decentralization, and Local-First Architectures 🚀

The Cloud-Native Revolution in Data ☁️

From Monoliths to Modular Building Blocks 🧱

The SQL vs. NoSQL Evolution Continues 🔄

Decentralization: Empowering Autonomy and User Agency 🌐

Local-First: Bringing Data Back Home 🏠

Automerge: A Local-First Library 🛠️

Limitations of Local-First ⚠️

Contributing to the Local-First Movement 🤝

Appendix

The Future of Data: Navigating Cloud-Native, Decentralization, and Local-First Architectures 🚀#

The Cloud-Native Revolution in Data ☁️#

From Monoliths to Modular Building Blocks 🧱#

The SQL vs. NoSQL Evolution Continues 🔄#

Decentralization: Empowering Autonomy and User Agency 🌐#

The AT Protocol and Decentralized Social Media 🕊️#

Local-First: Bringing Data Back Home 🏠#

Automerge: A Local-First Library 🛠️#

Limitations of Local-First ⚠️#

Contributing to the Local-First Movement 🤝#

Appendix#

The Future of Data: Navigating Cloud-Native, Decentralization, and Local-First Architectures 🚀

The Cloud-Native Revolution in Data ☁️

From Monoliths to Modular Building Blocks 🧱

The SQL vs. NoSQL Evolution Continues 🔄

Decentralization: Empowering Autonomy and User Agency 🌐

The AT Protocol and Decentralized Social Media 🕊️

Local-First: Bringing Data Back Home 🏠

Automerge: A Local-First Library 🛠️

Limitations of Local-First ⚠️

Contributing to the Local-First Movement 🤝

Appendix