Github Designing Data-intensive Applications !link! Access
This is where gh-ost (GitHub Online Schema Tool) shines. Traditional ALTER TABLE locks the table, blocking writes for minutes or hours. gh-ost instead creates a shadow table with the new schema, copies data in small chunks, and replays the binary log of writes from the original table onto the shadow table—all while the application continues running. At the final moment, it performs a near-instantaneous atomic swap of table names. This is a direct implementation of Kleppmann’s discussion of and eventual consistency . The system is in a temporary, inconsistent state (rows exist in both tables), but the application logic hides this complexity. The maintainability payoff is immense: GitHub can deploy schema changes hundreds of times per day, a velocity unthinkable in a system that required scheduled maintenance windows.
In the early days, GitHub was a simple platform for developers to share and collaborate on code. As the platform grew, so did the need for robust data-intensive applications to support its users. The team faced challenges in handling large amounts of data, ensuring scalability, and providing real-time insights to developers. github designing data-intensive applications
To overcome these challenges, the GitHub team adopted a data-intensive architecture, centered around the following key components: This is where gh-ost (GitHub Online Schema Tool) shines
Explore Vitess (used by YouTube) to see how massive MySQL clusters are sharded and managed across distributed environments. 5. Derived Data: Batch and Stream Processing At the final moment, it performs a near-instantaneous
Use DDIA as your map, but use GitHub as your training ground.
At its heart, GitHub must solve a fundamental impedance mismatch. Git is a content-addressable file system. It stores data as a directed acyclic graph (DAG) of blobs, trees, commits, and tags, identified by SHA-1 hashes. This is an immutable, decentralized data model. However, the GitHub web interface requires a centralized, queryable, relational view: “Show me all open pull requests authored by user X,” or “Which repositories does this commit belong to?”