Building Extensible Compute Platforms

The goal

There are several goals when thinking about internal compute platforms. What's interesting is that there are many options in a rich build vs buy space

  1. Costs
  2. Complexity
  3. Scalability
  4. Reliability
  5. Security
  6. Customer Experience
  7. Developer Experience

Context

The most recent example where the appropriate context helps to illustrate the impact on infrastructure decisions is Basecamp's leave AWS and GCP and return to their own datacenter.[1]

The article outlines two use-cases where the public cloud (either as infrastructure or platform) makes sense.

One is just starting out and you want to focus as little on DevOps and infrastructure as possible. A solution like Heroku makes sense. Just push code from CLI and you have a live app on the Internet!

The other solution was spiky demand, such as a surge in traffic or, a better case, is heavy computation that only comes in big and sudden spikes. Machine Learning training or intensive batch processes are examples.

In Basecamp's case, they were steady-state with predictable use across the stack. For them, bringing the compute back in-house gave them tremendous savings.

However, imagine a case where a business has a heavy reliance on complex data pipelines, and they use a cloud that already has this infrastructure. So it's not only taking advantage of elasticity when there are bursts, but there are additional primitives to manage the data from ingestion, transform, and processing.

The decision in this case becomes a build vs. buy, not just at the computing resource level, but the next level of abstraction.

Let's consider a different context which changes the computing platform context. You want data security, but the customer experience is the data needs to be fully transparent and trusted.

Blockchains provide a different model. The performance is much slower. The security features around ensuring that the block history or a new block is not forged are very high and very hard because the customer requirement is an adversarial, trust-minimzied one.

Another consideration of security might be isolation of workloads and the "cost" -- reducing administrative overhead on the compute. Concurrent might be a development environment that has embraced micro-services as an architecture. This could lead to compute that supports a WASM run-time with functions execution. Some of those "functions" might be customer-generated so the architecture to isolate each execution from the other is important.

Across all of these, the developer experience needs to be considered.

For example, within an internal company with multiple products, the priority is to achieve operating leverage: shorter TTM (Time To Market). I believe this will become a greater priority across all companies building software, which will make the design of internal development platforms high strategic value. Why Platform Product Managers Impact Valuations 1

What is the Infrastructure Inventory?


  1. Why we're leaving the cloud ↩︎