Michael Armbrust

The student who applied four times and built the data lakehouse

By VastBlue Editorial · 2026-03-26 · 18 min read

Series: The Inventors · Episode 10

Michael Armbrust

Application Number Four

Michael Armbrust applied to the computer science PhD programme at UC Berkeley and was rejected. He applied again. Rejected. A third time. Rejected. On his fourth application, he was accepted into the programme that would become the birthplace of Apache Spark, Delta Lake, and the data lakehouse architecture. The student they kept rejecting would become one of the most influential data systems engineers of his generation.

To understand what three rejections from Berkeley actually mean, you need to understand what Berkeley's computer science programme looked like in the late 2000s. The department was one of the top three in the world. Acceptance rates for the PhD programme hovered around eight to ten per cent — and those were the rates for applicants who had already filtered themselves through undergraduate programmes at competitive universities, strong GREs, and credible research experience. The applicant pool was not random. It was pre-selected. Being rejected from this pool was not unusual. Being rejected three times, and applying a fourth, was.

Each rejection carried compounding psychological weight. The first rejection is disappointing. The second is disheartening — it suggests the first was not a fluke. The third is the one where most people rewrite the narrative. They decide the programme was not right for them. They find a different path. They tell themselves a story about fit, or timing, or priorities. What they do not do is apply again. Armbrust applied again.

Armbrust has discussed this fact publicly, and it is worth pausing on because it illuminates something about how foundational engineering actually happens. The committee that evaluated his applications was not incompetent. PhD admissions at Berkeley are intensely competitive, and reasonable people applying reasonable criteria can reject candidates who go on to do extraordinary work. The system is not designed to identify future impact — it is designed to manage scarcity. Admissions committees optimise for legibility — clear research trajectories, strong letters from known advisors, publications in recognised venues. What they cannot optimise for is the trait that turned out to matter most in Armbrust's career: the willingness to iterate on the same problem until it yields.

What Armbrust's four applications reveal is not that admissions committees are broken. It is that persistence, in engineering, is not a personality trait — it is a technical capability. The person who applies four times is the person who will debug the same system for four months, or iterate on the same architecture for four years, until it works. The admissions committee was evaluating credentials. They could not evaluate tenacity because tenacity is only visible in retrospect. It is the one quality that cannot be demonstrated on an application form — only through years of applied effort.

Spark SQL and the Catalyst Optimizer

Once at Berkeley, Armbrust joined the AMPLab — the same research group where Matei Zaharia was developing Spark. Armbrust's first major contribution was Spark SQL, and its significance was both technical and sociological. Technically, Spark SQL brought structured query capabilities to Spark's distributed processing engine. Sociologically, it made Spark accessible to the millions of data professionals who spoke SQL but not Scala or Java.

The sociological impact deserves emphasis. By the early 2010s, the data industry had split into two camps. On one side were the engineers — people who wrote MapReduce jobs in Java, or Spark programmes in Scala, and understood distributed systems well enough to reason about partitioning, shuffles, and fault tolerance. On the other side were the analysts — people who wrote SQL queries, built dashboards, and understood business logic but had no interest in learning a general-purpose programming language. These two camps worked with the same data but used entirely different tools, and the translation layer between them — typically an overworked data engineering team — was a persistent bottleneck.

Spark SQL bridged this divide with the DataFrame API. A DataFrame was a distributed collection of rows with named, typed columns — conceptually identical to a table in a relational database or a data frame in R or pandas. Data professionals could interact with it using SQL syntax (writing queries as strings) or using a programmatic API (chaining method calls in Python, Scala, or Java). Both interfaces compiled to the same internal representation, were optimised by the same engine, and executed with the same performance characteristics. The analyst writing SQL and the engineer writing Scala were, at the execution level, writing the same programme.

100,000+ organisations using Spark by 2020 — Spark SQL and the DataFrame API were the primary drivers of adoption, making distributed data processing accessible to SQL-fluent analysts who had previously been locked out of the big data ecosystem.

How Catalyst Reasons About Queries

The technical contribution was not simply bolting a SQL parser onto Spark. Armbrust designed Catalyst — a query optimiser that analysed SQL queries (or their programmatic equivalents in Spark's DataFrame API), applied a cascade of optimisation rules, and generated efficient distributed execution plans. To understand why Catalyst mattered, consider what happens when a user writes a query that joins three tables, filters on two columns, and aggregates the results.

A naive execution would process the query as written: read all three tables in full, join them in the order specified, apply the filters, then aggregate. Catalyst transforms this into something radically more efficient. First, it parses the query into an unresolved logical plan — an abstract syntax tree representing the query's intent without yet binding it to actual data sources. Then it resolves references against the catalogue (verifying that tables and columns exist), producing a resolved logical plan. Then the optimiser applies transformation rules — dozens of them, executed in multiple passes.

Predicate pushdown moves filter conditions as close to the data source as possible. If the query filters on a date column, Catalyst pushes that filter down to the file reader, so that entire Parquet row groups that fall outside the date range are never read from storage at all. On a terabyte dataset where the query needs only one day's data, this single optimisation can reduce I/O by orders of magnitude.

Join reordering chooses the most efficient sequence of joins based on table statistics. Joining a billion-row fact table with a thousand-row dimension table first, then joining the result with a million-row table, is vastly more efficient than joining the two large tables first. Catalyst uses column statistics, row counts, and partition information to determine the optimal join order — a combinatorial problem that human programmers often get wrong because they lack complete information about data distributions at query-writing time.

Constant folding evaluates constant expressions at planning time rather than execution time. Column pruning reads only the columns actually referenced by the query, which in columnar storage formats like Parquet means entire columns of data are never loaded from disk. Broadcast join detection identifies when one side of a join is small enough to fit in memory and broadcasts it to all executors, avoiding an expensive shuffle operation.

The critical design decision was that Catalyst was extensible. The optimiser was implemented as a library of tree-transformation rules applied to the query plan, and developers could add custom rules for domain-specific patterns. A financial data team could add rules that optimised time-series joins — recognising that temporal joins with sorted timestamps could use merge-join strategies rather than hash joins. A genomics team could add rules that optimised variant-call queries by pushing filters into specialised file formats. A geospatial team could add spatial index lookups. This extensibility meant that Spark SQL could be adapted to specific workloads without modifying the core engine — each team could embed their domain knowledge into the optimiser itself.

The result, counterintuitively, was that SQL queries on Spark frequently outperformed hand-written Spark programmes. The optimiser could find efficiencies that human programmers missed — reordering operations, eliminating redundant computations, choosing more efficient join strategies, selecting optimal partitioning schemes. A hand-written Spark programme expressed a specific execution plan. A Catalyst-optimised query expressed an intent, and the optimiser found the best execution plan for that intent given the actual data characteristics at query time. The abstraction was not a performance tax. It was a performance advantage. The same pattern that compiler writers had demonstrated for decades — that optimising compilers outperform hand-written assembly for all but the most trivial programmes — now applied to distributed data processing.

Delta Lake: Making Data Lakes Reliable

Armbrust's most consequential contribution came after Berkeley, at Databricks. Every large organisation that used data lakes — and by the mid-2010s, that was most large organisations — lived with a dirty secret: their data was unreliable. A conventional data lake stored files in cloud object storage (Amazon S3, Azure Data Lake Storage, Google Cloud Storage) without transactional guarantees. Multiple writers could corrupt data by writing simultaneously. Failed pipelines left partial results — half-written files that corrupted downstream analysis. There was no mechanism for rolling back a bad write. Schema changes could silently break pipelines. And there was no way to query what the data looked like yesterday, because every write overwrote the previous state.

The industry workaround was to maintain two separate systems: a data lake for cheap, flexible storage, and a data warehouse (Snowflake, Redshift, BigQuery) for reliable, structured queries. Data was ETL'd from lake to warehouse, a process that was expensive, slow, and — ironically — itself a source of data quality problems. The two-system architecture doubled infrastructure costs, doubled operational complexity, and introduced latency between when data landed in the lake and when it was available for reliable querying.

ACID Transactional guarantees brought to data lakes by Delta Lake — Atomicity, consistency, isolation, durability — the reliability properties that data lakes had never had. Applied to open formats on cheap storage, eliminating the need for a separate data warehouse.

The Transaction Log

Delta Lake solved this by adding a transaction log — the _delta_log directory — a metadata layer that recorded every change to a dataset as a versioned, atomic operation. The mechanics are precise. Each write to a Delta table produces a new JSON file in the transaction log, numbered sequentially: 000000.json, 000001.json, 000002.json. Each log entry records the specific actions performed — which Parquet data files were added, which were removed, what metadata changed. A reader reconstructs the current state of the table by replaying the log entries from the beginning (or from the most recent checkpoint) to the present version.

Atomicity — the A in ACID — was achieved through a simple but effective mechanism: optimistic concurrency control. A writer reads the current version of the log, performs its computation, then attempts to write the next log entry. If another writer has committed a new version in the meantime, the write fails and must be retried. On cloud object storage, where file creation is atomic (a file either exists completely or not at all), this guarantees that each transaction is all-or-nothing. No partial writes. No corrupted state. No half-finished files polluting the lake.

Time travel — the ability to query the state of a table at any previous point — fell out naturally from the log-based architecture. To see what the data looked like at version 47, simply replay the transaction log up to entry 47 and ignore everything after. The data files for previous versions remained on storage (subject to configurable retention policies), so historical queries required no separate backup system, no manual snapshots, no ETL to a historical archive. The command was trivially simple: SELECT * FROM table VERSION AS OF 47. Or, for timestamp-based queries: SELECT * FROM table TIMESTAMP AS OF '2024-01-15'. Auditors, regulators, and debugging engineers could answer the question "what did this data look like last Tuesday?" without any advance preparation.

Schema enforcement was the fourth pillar. Delta Lake validated every write against the table's declared schema. If a data pipeline attempted to write a column with an unexpected type, or omitted a required column, the write was rejected with a clear error rather than silently corrupting downstream consumers. Schema evolution — adding new columns, widening types — was supported explicitly through merge operations, but only when the user opted in. The default behaviour was strict enforcement, because Armbrust understood that in data engineering, silent failures are more expensive than loud ones.

The underlying storage format remained open — Apache Parquet files on cloud object storage, readable by any tool that understands Parquet. The transaction log added reliability without imposing a proprietary format. This openness was strategic: it meant Delta Lake could be adopted without vendor lock-in, a critical factor for enterprise adoption. The lakehouse thesis — apply warehouse-grade reliability to lake-priced storage — eliminated the need for the two-system architecture. One system, one copy of the data, warehouse reliability, lake economics. Delta Lake became the storage layer for Databricks' platform and was adopted broadly across the industry.

Structured Streaming: Unifying Batch and Real-Time

Armbrust also led the design of Structured Streaming — Spark's engine for processing real-time data. The design insight was deceptively simple: treat a real-time data stream as a table that grows continuously. New data arrives as new rows appended to the table. Queries against the stream use the same SQL syntax and the same Catalyst optimiser as queries against static tables. The engine handles the complexities of windowing (grouping events into time windows), watermarking (handling late-arriving data), and exactly-once processing (guaranteeing that each event is processed once and only once, even in the presence of failures) transparently.

Micro-Batch vs Continuous Processing

Under the hood, Structured Streaming offered two execution modes. The default was micro-batch processing: the engine accumulated incoming data into small batches (typically every few hundred milliseconds to a few seconds), processed each batch as a conventional Spark job, and wrote the results atomically. This approach traded a small amount of latency — each event was delayed by the batch interval — for enormous simplicity. Each micro-batch was a complete, self-contained Spark job that could be optimised, checkpointed, and recovered using the same mechanisms as batch processing. Fault tolerance came for free.

For use cases that demanded sub-millisecond latency, Spark introduced an experimental continuous processing mode that processed events one at a time without batching. The tradeoff was explicit: continuous mode offered lower latency but weaker exactly-once guarantees (providing at-least-once instead) and supported a narrower set of operations. Armbrust's architectural decision to make micro-batch the default was pragmatic. Most real-world streaming use cases — dashboard updates, ETL pipelines, alerting systems — could tolerate a few hundred milliseconds of latency. What they could not tolerate was data loss or duplication.

Watermarking and Late Data

Watermarking addressed one of the hardest problems in stream processing: late-arriving data. In a perfect system, events arrive in order. In real systems — especially those ingesting data from mobile devices, IoT sensors, or distributed applications — events frequently arrive out of order. A temperature reading timestamped at 14:00 might arrive at 14:05 because of network delays. A click event from a mobile phone might arrive hours late because the device was temporarily offline.

Without watermarks, the engine would have to keep every time window open indefinitely, waiting for arbitrarily late data — consuming unbounded memory and preventing any window from ever being finalised. With watermarks, the user specified a threshold: "I expect data to arrive no more than ten minutes late." The engine then closed windows that were older than the watermark, freeing memory and producing final results. Events arriving after the watermark were dropped. The tradeoff was explicit and configurable, giving engineers precise control over the accuracy-versus-resource tradeoff for their specific use case.

The Unification Thesis

The significance of this unification is operational. Before Structured Streaming, organisations typically maintained separate systems for batch processing (processing historical data) and stream processing (processing real-time data) — often with different engines, different APIs, different deployment procedures, and different engineering teams. A data pipeline that analysed yesterday's data used one codebase. A pipeline that analysed today's data in real time used a different codebase. Keeping both pipelines consistent — producing the same results from the same logic — was a persistent source of bugs and operational burden. The industry even had a name for this problem: the Lambda Architecture, and it was universally acknowledged as painful.

Armbrust's design eliminated this duality. The same query, written once, could process both historical and real-time data. The same code, the same optimiser, the same guarantees. One codebase. One truth. The engineer who had applied to Berkeley four times had, once again, refused to accept that two separate solutions were necessary when one unified solution was possible — even if building the unified solution required years of additional iteration.

The Compound Engineer

Spark SQL, Delta Lake, and Structured Streaming are not independent products. They are layers of a single, coherent vision: a unified data platform where SQL, batch processing, real-time processing, and reliable storage all work together as though they were designed as one system. Which, of course, they were — by a small group of engineers at Berkeley and Databricks, with Armbrust at the centre of the technical architecture.

The layering is the architecture. Catalyst optimises both batch and streaming queries because Structured Streaming was designed to reuse the same optimiser. Delta Lake provides reliable storage for both batch outputs and streaming sinks because the transaction log was designed to handle concurrent writes from multiple streaming jobs. Schema enforcement in Delta Lake protects the data quality that Spark SQL queries depend on. Time travel in Delta Lake enables debugging of Structured Streaming pipelines by allowing engineers to inspect the exact state of a table at the moment a streaming job produced an anomalous result. Every layer strengthens every other layer.

The most powerful engineering is not building new things. It is making existing things work together as though they were designed as one. The seams disappear. The system coheres. The user stops thinking about plumbing and starts thinking about problems.

Editorial observation

Each layer built on the one before it. Spark SQL made Spark accessible to SQL users. Delta Lake made Spark reliable for enterprise workloads. Structured Streaming made Spark capable of real-time processing with the same reliability guarantees. The stack is the achievement. And the engineer who built it is the one who was rejected three times before he was given the chance.

Persistence as Architecture

There is a structural parallel between Armbrust's personal story and his engineering contributions that is too precise to be coincidental. A person who applies to the same programme four times is a person who believes that the right approach, applied with sufficient iteration, will eventually succeed. This is also the core thesis of every system he built.

Catalyst is, at its core, an iterative system. It applies transformation rules in multiple passes, each pass refining the plan further, until the plan converges on an optimal execution strategy. The Delta Lake transaction log is an iterative structure — each version builds on every previous version, and the current state is the accumulation of every change ever made. Structured Streaming processes data iteratively, one micro-batch at a time, each building on the state produced by the previous one. The architecture mirrors the architect.

The lesson from Armbrust is not inspirational in the shallow sense. It is structural. The persistence that led him to apply four times to Berkeley is the same persistence that led him to iterate on Catalyst's optimiser until it outperformed hand-written code, to refine Delta Lake's transaction log until it could guarantee ACID on cloud object storage, to design Structured Streaming until batch and real-time were genuinely unified. Persistence is not about wanting it badly enough. It is about working on the same problem, layer by layer, until the layers add up to something that changes how an industry operates.

The admissions committee that rejected Armbrust three times was not wrong by its own criteria. It was measuring the wrong thing. It was measuring credentials when it should have been measuring tenacity. But tenacity cannot be measured on an application form. It can only be measured in systems — in code that is rewritten until it is right, in architectures that are iterated until they cohere, in the stubborn refusal to accept that the problem is too hard or that two mediocre solutions are good enough when one excellent solution is possible. Armbrust's career is what tenacity looks like when it is applied to data systems for fifteen years. The results are visible in every organisation that runs Spark SQL, stores data in Delta Lake, or processes streams through Structured Streaming. The trait that the admissions committee could not see is the one that mattered most.

Sources

  1. Armbrust, M. et al. "Spark SQL: Relational Data Processing in Spark." SIGMOD 2015. — https://dl.acm.org/doi/10.1145/2723372.2742797
  2. Armbrust, M. et al. "Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores." VLDB 2020. — https://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf
  3. Armbrust, M. et al. "Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark." SIGMOD 2018. — https://dl.acm.org/doi/10.1145/3183713.3190664
  4. Public interviews and talks by Michael Armbrust discussing Berkeley applications, Databricks engineering blog — https://www.databricks.com/blog
  5. Databricks company valuation reports, 2023-2024
  6. Zaharia, M. et al. "Apache Spark: A Unified Engine for Big Data Processing." Communications of the ACM, 2016. — https://dl.acm.org/doi/10.1145/2934664