Skip to content

Netflix Guide To Microservices

What Microservices are Not#

Monolithic code base#

  • Everyone contributes to a single codebase
  • Changes and errors were difficult to diagnose
  • A Week troubleshooting memory leaks

Monolithic Database#

  • One piece of hardware running 1 big database
  • When it went down, everything went down
  • Looking for bigger hardware to vertically scale the application
  • Adding a column to a table was a big cross functional process

What is a Microservice#

Developing a single application as a suite of small services, each running in its own process. Communicating with lightweight mechanisms often an HTTP resource API. - Martin Fowler

  • Separation of concerns
  • scalability - lend themselves to horizontal scaling and workload partitioning
  • Virtualisation and elasticity - automated operations and on demand provisioning

Edge Services#

ELB (Elastic Load Balancer) -> Zuul Proxy layer (dynamic routing) -> Core API

Middle Tier and Platform Services#

  • AB testing service
  • Subscriber service
  • Recommendation service
  • Platform services: Routing, configuration and crypto
  • Persistence: Cache and DB

Data is typically stored in your persistence layer

The microservice is an abstraction - containing all these things:

  • Service client
  • Persistence
  • Cache client

It is not just the stateless application

Challenges and Solutions#

  • Dependency
  • Scale
  • Variance
  • Change


Intra-service requests#

  • Network or latency issues
  • service you are calling is not fast and efficient

A single service failing could cascade issues

To prevent this netflix created hysterix:

  • structured way for handling timeouts and retries
  • fallback to show some data
  • isolated thread pools

Service should function even when dependencies go away


CAP Theorem says it is impossible for a distributed datastore to simultaneously provide 2 of these:

  • Consistency - every read receives the most recent write or error
  • Availability - every read receives a response
  • Partition Tolerance - System continues to operate despite arbitrary number of dropped messages

Netflix chose Cassandra and wanted eventual consistency


Everything fails

Don’t put all your eggs in one basket

3 Regions were used


Stateless service * no cache, no database * frequently accessed metadata * No instance affinity - a customer will use various instances * Loss of a node is a non-event (ephemeral) * Recovery is very quick

Autoscaling: Min, max and metric to use when scaling your group

Advantages of autoscaling:

  • compute efficiency (using on-demand capacity)
  • Node failures are not big deals
  • Traffic spikes, DDOS or performance bug allows you to absord that change and figure out what happened

Surviving instance failure - chaos monkey

Stateful service * Database and caches * Avoid storing business logic and state within one application * Loss of a node is a notable event

Redundancy is fundamental - 2 kidneys, 2 lungs

EVCache - relying on it at 800k - 1M Request Per Second


Variety in your architecture

The more variance you have the greater your challenges - increases complexity

Operational Drift#


  • Alert thresholds
  • timeouts, retries and fallbacks
  • throughput (RPS)

Autonomic nervous system - body just takes care of - don’t need to think about breathing or how to digest food. Make these processes subconscious.

Use continuous learning and automation - this is how knowledge becomes code.

Incident -> Resolution -> Review -> Remediation -> Analysis -> Best Practice -> Automation -> Adoption

Production ready checklist (automation and continuous improvement behind it):

  • Alerts
  • autoscaling
  • chaos
  • consistent naming
  • ELB config
  • Healthcheck
  • Immutable machine images
  • Squeeze testing
  • timeouts, retries and fallbacks

Polyglot and Containers#

Intentional - people adding new programming languages into the microservices architecture

The paved road (best of breed tech that worked best) - automation and integration baked in - so developers could be agile. Then there was the rocky road new tech and docker.

Cost of variance:

  • productivity tooling
  • different tooling for memory and cpu on containers
  • Base image fragmentation - more specialised
  • learning curve - things break in interesting and new ways

Key points:

  • Raise the awareness of costs
  • prioritise by impact
  • seek reusable solutions

Integrated delivery:

  • Test out the code changes with some real traffic and determine if the code is better
  • Staged deployments - 1 region at a time

Conway’s Law

Organisations which design systems are constrained to produce designs which are copies of the communication structures of these organisations

Any piece of software reflects the organisational structure that produced it

This is not solutions first, it was organisation first.

Organisation should be refactored based on the value or way we deliver value.