Databases, Events and Scale
Databases, Events and Scale#
- horizontally scaling vs vertical
- Sharding database (splitting index eg. A - H, I - Z) with MySQL, PostgreSQL must be done manually - whereas noSQL and some cloud provider give automatic sharding.
- Cluster proxy / load balancer in front of shards
- Availability - through replication
- Issues of consistency - sacrifice instantaneous consistency for eventual consistency
- NoSQL databases were designed with scale in mind - automatic scaling (cassandra) - shards in nosql are equal - shards in SQL are not
- gossip protocol controls communication between nodes
- NoSQL: availability prioritized over consistency
- DynamoDB is built for scale - it won’t let you write a query that won’t scale
Problems with relational databases at scale:
- Slow output of SQL Joins - CPU intensive - linear time - as db grows performance drops
- Difficulty horizontal scaling - vertical vs horizontal (limits for vertical) - horizontal scaling means each node only handles a portion of the data - tricky with relational dbs as there is no clear way to split the data. Horizontal scaling works best when a single request can be handled by a single node.
- Unbounded nature of queries - get all records (SELECT * FROM table) - scans/reads entire db. Grouping and ordering is also expensive. Aggregations are ready to take down your db at scale.
How NoSQL fix these problems:
- Do not allow joins - RDBMS use normalisation (do not repeat) benefits are storage efficiency and data integrity.. NoSQL you must know how your data must be read and written up front to avoid the need for flexibility. Data integrity is now the concern of the application - sometimes allowing duplicates easier for fields that do not change. Storage is now less expensive.
- Forces you to segment your data
- Puts explicit bounds on your queries
Cloud providers want you using noSQL as it is storage inefficient
NoSQL requires you split up the data into smaller pieces and that you run your queries just with each piece. A partition key or segment key.
For example a userID:
Ids 0 - 3333: Node 0 Ids 3333 - 6666: Node 1 Ids 6666 - 9999: Node 2
All queries must include the partition key
Most nosql dbs hash the partition key before assigning it to a node
Then there is a router or cluster proxy
This sharding mechanism is what allows for NoSQL’s essentially infinite scale without performance degradation
NoSQL puts a 1MB limit on the query and scan
If there are more than 1MB results the resturned query will have a
LastEvaluatedKey to handle pagination on the client side.
To guarantee single digit millisecond response.
- find node for partition tree - hash table: O(1)
- find starting value for sort key - B-tree: O(log n)
- read values - sequential read to max 1MB
No support for
It is likely better for tha application to store aggregations in a summary item collection.
Problems with NoSQL is paginating - fix on application side - just get the most recent. Also hot keys - frequently queries data - tricky to find when designing db. Can scale to 3000 read capacity units (RCU) and 1000 write capacity units (WCU) per second.
OLAP - Online analytical processing OLTP - Online transaction processing
Moving from OLTP to OLAP is done with an ETL process - extracting, transforming, and loading data.
Hash table: A key value store - During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. the hash function is O(1).
Eg. When a file is uploaded to s3 a lambda function is called
Is it for your use case?
- Are you passing around self contained transactions
- Are useful events available for free? Eg. S3 to lambda
- Do you strongly decouple your microservices? when one service gets a spike in traffic - do the others have to follow suit? Perhaps other services use different programming languages in different offices in different departments
What is an event?
- can have some generic fields at the top level - standard envelope
- then event-detail
- have a source field
- similar to awx - cloudevents.io
- difficult to change after shipping events
- Why should you even care what it is - the software doesn’t care.
- filtering, routing and storing is important
- Some events - the order matters (FIFO) requires a single source and single destination. Sometimes only the single user session needs order.
- duplication? suppose your service crashes - you didn’t get the message so the infrastructure sends it again. To detect this - backends all have transactional databases and detect. Another way is to make api idempotent. Eg. Create queue - if it exists it does nothing
- kafka - broadcast an event and everything gets to see it (pub/sub)
- Getting the events: push or pull. Pushing events is promising you can process them as fast as we send them. Polling is better - only process when you have the capacity.
- Serveless approach - over http, broker/cluster - rabbitmq or kafka - have to be managed and have a flat top scaling cruve, advantages - don’t use http - use tcp/ip with ampq - lower latency solved with http2, http3 (QUIC) - U is UDP
- Filters - allow you to only subscribe to certain events
- json vs binary format? in kakfa and grpc alot of people use binary formats: avro, thrift, protocol buffers…bottleneck is not usually here
- heterogenous vs uniform events - majority are lots of different events - allowing strongly typed schemas
Reporting - push that information down to the client (clients have enough cpu to figure it out)
- Entity relationship diagram - tables
- table defining access patterns - table, gsi to query, filter conditions
- Look at what data is changing and what is immutable
NoSQL workbench for DynamoDB