High durability (99.999999999%) of objects across multiple AZ
If you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years
99.99% Availability over a given year
Sustain 2 concurrent facility failures
Use Cases: Big Data analytics, mobile & gaming applications, content distribution…
S3 Standard – Infrequent Access (IA)
Suitable for data that is less frequently accessed, but requires rapid access when needed
High durability (99.999999999%) of objects across multiple AZs
99.9% Availability
Low cost compared to Amazon S3 Standard
Sustain 2 concurrent facility failures
Use Cases: As a data store for disaster recovery, backups…
S3 One Zone - Infrequent Access (IA)
Same as IA but data is stored in a single AZ
High durability (99.999999999%) of objects in a single AZ; data lost when AZ is destroyed
99.5% Availability
Low latency and high throughput performance
Supports SSL for data at transit and encryption at rest
Low cost compared to IA (by 20%)
Use Cases: Storing secondary backup copies of on-premise data, or storing data you can recreate
S3 Intelligent Tiering
Same low latency and high throughput performance of S3 Standard
Small monthly monitoring and auto-tiering fee
Automatically moves objects between two access tiers based on changing access patterns
Designed for durability of 99.999999999% of objects across multiple Availability Zones
Resilient against events that impact an entire Availability Zone
Designed for 99.9% availability over a given year
Amazon Glacier
Low cost object storage meant for archiving / backup
Data is retained for the longer term (10s of years)
Alternative to on-premise magnetic tape storage
Average annual durability is 99.999999999%
Cost per storage per month ($0.004 / GB) + retrieval cost
Each item in Glacier is called “Archive” (up to 40TB)
Archives are stored in ”Vaults”
Amazon Glacier & Glacier Deep Archive
Amazon Glacier – 3 retrieval options:
Expedited (1 to 5 minutes)
Standard (3 to 5 hours)
Bulk (5 to 12 hours)
Minimum storage duration of 90 days
Amazon Glacier Deep Archive – for long term storage – cheaper:
Standard (12 hours)
Bulk (48 hours)
Minimum storage duration of 180 days
S3 Storage Classes Comparison
S3 – Moving between storage classes
You can transition objects between storage classes
For infrequently accessed object, move them to
STANDARD_IA
For archive objects you don’t need in real-time, GLACIER or
DEEP_ARCHIVE
Moving objects can be automated using a lifecycle configuration
S3 Lifecycle Rules
Transition actions: It defines when objects are transitioned to another storage class.
Move objects to Standard IA class 60 days after creation
Move to Glacier for archiving after 6 months
Expiration actions: configure objects to expire (delete) after some time
Access log files can be set to delete after a 365 days
Can be used to delete old versions of files (if versioning is enabled)
Can be used to delete incomplete multi-part uploads
Rules can be created for a certain prefix (ex - s3://mybucket/mp3/*)
Rules can be created for certain objects tags (ex - Department: Finance)
AWS S3 - Versioning
You can version your files in AWS S3
It is enabled at the bucket level
Same key overwrite will increment the “version”: 1, 2, 3….
It is best practice to version your buckets
Protect against unintended deletes (ability to restore a version)
Easy roll back to previous version
Any file that is not versioned prior to enabling versioning will have version “null”
You can “suspend” versioning
S3 Cross Region Replication
Must enable versioning (source and destination)
Buckets must be in different AWS regions
• Can be in different accounts
Asynchronous
• Copying is asynchronous
replication
Must give proper IAM permissions to S3
Use cases: compliance, lower latency access, replication across accounts
AWS S3 – ETag (Entity Tag)
How do you verify if a file has already been uploaded to S3?
Names work, but how are you sure the file is exactly the same?
For this, you can use AWS ETags:
For simple uploads (less than 5GB), it’s the MD5 hash
For multi-part uploads, it’s more complicated, no need to know the algorithm
Using ETag, we can ensure integrity of files
AWS S3 Performance – Key Names Historic fact and current exam
When you had > 100 TPS (transaction per second), S3 performance could degrade
Behind the scene, each object goes to an S3 partition and for the best performance, we want the highest partition distribution
In the exam, and historically, it was recommended to have random characters in front of your key name to optimise performance:/5r4d_my_folder/my_file1.txt,/a91e_my_folder/my_file2.txt
It was recommended never to use dates to prefix keys:/2018_09_09_my_folder/my_file1.txt,/2018_09_10_my_folder/my_file2.txt
Option 1: Partition key only (HASH)
Partition key must be unique for each item
Partition key must be “diverse” so that the data is distributed
Example: user_id for a users table
Option 2: Partition key + Sort Key
The combination must be unique
Data is grouped by partition key
Sort key == range key
Example: users-games table
user_id for the partition key
game_id for the sort key
DynamoDB – Partition Keys exercise
We’re building a movie database
What is the best partition key to maximize data distribution?
• movie_id
• producer_name
• leader_actor_name
• movie_language
• movie_id has the highest cardinality so it’s a good candidate
• moving_language doesn’t take many values and may be skewed towards English so it’s not a great partition key
DynamoDB in Big Data
Common use cases include:
• Mobile apps
• Gaming
• Digital ad serving
• Live voting
• Audience interaction for live events
• Sensor networks
• Log ingestion
• Access control for web-based content
• Metadata storage for Amazon S3 objects
• E-commerce shopping carts
• Web session management
Anti Pattern
Prewritten application tied to a traditional relational database: use RDS instead
Joins or complex transactions
Binary Large Object (BLOB) data: store data in S3 & metadata in DynamoDB
Large data with low I/O rate: use S3 instead
DynamoDB – Provisioned Throughput
Table must have provisioned read and write capacity units
Read Capacity Units (RCU): throughput for reads
Write Capacity Units (WCU): throughput for writes
Option to setup auto-scaling of throughput to meet demand
Throughput can be exceeded temporarily using “burst credit”
If burst credit are empty, you’ll get a “ProvisionedThroughputException”.
It’s then advised to do an exponential back-off retry
DynamoDB – Write Capacity Units
One write capacity unit represents one write per second for an item up to 1 KB in size.
If the items are larger than 1 KB, more WCU are consumed
Example 1: we write 10 objects per seconds of 2 KB each.
We need 2 * 10 = 20 WCU
Example 2: we write 6 objects per second of 4.5 KB each
We need 6 * 5 = 30 WCU (4.5 gets rounded to the upper KB)
Example 3: we write 120 objects per minute of 2 KB each
We need 120 / 60 * 2 = 4 WCU
Strongly Consistent Read vs Eventually Consistent Read
Eventually Consistent Read: If we read just after a write, it’s possible we’ll get unexpected response because of replication
Strongly Consistent Read: If we read just after a write, we will get the correct data
By default: DynamoDB uses Eventually Consistent Reads, but GetItem, Query & Scan provide a
“ConsistentRead” parameter you can set to True
DynamoDB – Read Capacity Units
One read capacity unit represents one strongly consistent read per second, or two eventually consistent reads per second, for an item up to 4 KB in size.
If the items are larger than 4 KB, more RCU are consumed
Example 1: 10 strongly consistent reads per seconds of 4 KB each
We need 10 * 4 KB / 4 KB = 10 RCU
Example 2: 16 eventually consistent reads per seconds of 12 KB each
We need (16 / 2) * ( 12 / 4 ) = 24 RCU
Example 3: 10 strongly consistent reads per seconds of 6 KB each
We need 10 * 8 KB / 4 = 20 RCU (we have to round up 6 KB to 8 KB)
DynamoDB - Throttling
If we exceed our RCU or WCU, we get
ProvisionedThroughputExceededExceptions
Reasons:
• Hot keys / partitions: one partition key is being read too many times (popular item for ex)
• Very large items: remember RCU and WCU depends on size of items
Solutions:
• Exponential back-off when exception is encountered (already in SDK)
• Distribute partition keys as much as possible
• If RCU issue, we can use DynamoDB Accelerator (DAX)
DynamoDB – Partitions Internal
DynamoDB – Writing Data
PutItem - Write data to DynamoDB (create data or full replace)
• Consumes WCU
UpdateItem – Update data in DynamoDB (partial update of attributes)
• Possibility to use Atomic Counters and increase them
Conditional Writes:
• Accept a write / update only if conditions are respected, otherwise reject
• Helps with concurrent access to items
No performance impact
DynamoDB – Deleting Data
DeleteItem
• Delete an individual row
• Ability to perform a conditional delete
DeleteTable
• Delete a whole table and all its items
Much quicker deletion than calling DeleteItem on all items
DynamoDB – Batching Writes
BatchWriteItem
Up to 25 PutItem and / or DeleteItem in one call
Up to 16 MB of data written
Up to 400 KB of data per item
Batching allows you to save in latency by reducing the number of API calls done against DynamoDB
Operations are done in parallel for better efficiency
It’s possible for part of a batch to fail, in which case we have the try the failed items (using exponential back-off algorithm)
DynamoDB – Reading Data
GetItem:
Read based on Primary key
Primary Key = HASH or HASH-RANGE
Eventually consistent read by default
Option to use strongly consistent reads (more RCU - might take longer)
ProjectionExpression can be specified to include only certain attributes
BatchGetItem:
Up to 100 items
Up to 16 MB of data
Items are retrieved in parallel to minimize latency
FilterExpression to further filter (client side filtering)
Returns:
Up to 1 MB of data
Or number of items specified in Limit
Able to do pagination on the results
Can query table, a local secondary index, or a global secondary index
DynamoDB - Scan
Scan the entire table and then filter out data (inefficient)
Returns up to 1 MB of data – use pagination to keep on reading
Consumes a lot of RCU
Limit impact using Limit or reduce the size of the result and pause
For faster performance, use parallel scans:
Multiple instances scan multiple partitions at the same time
Increases the throughput and RCU consumed
Limit the impact of parallel scans just like you would for Scans
Can use a ProjectionExpression + FilterExpression (no change to RCU)
DynamoDB – LSI (Local Secondary Index)
Alternate range key for your table, local to the hash key
Up to five local secondary indexes per table.
The sort key consists of exactly one scalar attribute.
The attribute that you choose must be a scalar String, Number, or Binary
LSI must be defined at table creation time
DynamoDB – GSI (Global Secondary Index)
To speed up queries on non-key attributes, use a Global Secondary Index
GSI = partition key + optional sort key
The index is a new “table” and we can project attributes on it
The partition key and sort key of the original table are always projected (KEYS_ONLY)
Can specify extra attributes to project (INCLUDE)
Can use all attributes from main table (ALL)
Must define RCU / WCU for the index
Possibility to add / modify GSI (not LSI)
DynamoDB - DAX
DAX = DynamoDB Accelerator
Seamless cache for DynamoDB, no application re-write
Writes go through DAX to DynamoDB
Micro second latency for cached reads & queries
Solves the Hot Key problem (too many reads)
5 minutes TTL for cache by default
Up to 10 nodes in the cluster
Multi AZ (3 nodes minimum recommended for production)
Secure (Encryption at rest with KMS, VPC, IAM, CloudTrail…)
DynamoDB Streams
Changes in DynamoDB (Create, Update, Delete) can end up in a DynamoDB Stream
This stream can be read by AWS Lambda, and we can then do:
React to changes in real time (welcome email to new users)
Create derivative tables / views
Insert into ElasticSearch
Could implement Cross Region Replication using Streams
Stream has 24 hours of data retention
Configurable batch size (up to 1,000 rows, 6 MB)
DynamoDB Streams Kinesis Adapter
Use the KCL library to directly consume from DynamoDB Streams
You just need to add a “Kinesis Adapter” library
The interface and programming is exactly the same as Kinesis Streams
That’s the alternative to using AWS Lambda
DynamoDB TTL (Time to Live)
TTL = automatically delete an item after an expiry date / time
TTL is provided at no extra cost, deletions do not use WCU / RCU
TTL is a background task operated by the DynamoDB service itself
Helps reduce storage and manage the table size over time
Helps adhere to regulatory norms
TTL is enabled per row (you define a TTL column, and add a date there)
DynamoDB typically deletes expired items within 48 hours of expiration
Deleted items due to TTL are also deleted in GSI / LSI
DynamoDB Streams can help recover expired items
DynamoDB – Security & Other Features
Security:
VPC Endpoints available to access DynamoDB without internet
Access fully controlled by IAM
Encryption at rest using KMS
Encryption in transit using SSL / TLS
Backup and Restore feature available
Point in time restore like RDS
No performance impact
Global Tables
Multi region, fully replicated, high performance
Amazon Database Migration Service (DMS) can be used to migrate to DynamoDB (from Mongo, Oracle, MySQL, S3, etc…)
You can launch a local DynamoDB on your computer for development purposes
DynamoDB – Storing large objects
Max size of an item in DynamoDB = 400 KB
For large objects, store them in S3 and reference them in DynamoDB Amazon S3
AWS ElastiCache Overview
The same way RDS is to get managed Relational Databases…
ElastiCache is to get managed Redis or Memcached
Caches are in-memory databases with really high performance, low latency
Helps reduce load off of databases for read intensive workloads
Helps make your application stateless
Write Scaling using sharding
Read Scaling using Read Replicas
Multi AZ with Failover Capability
AWS takes care of OS maintenance / patching, optimizations, setup, configuration, monitoring, failure recovery and backups
Redis Overview
Redis is an in-memory key-value store
Super low latency (sub ms)
Cache survive reboots by default (it’s called persistence)
Great to host
User sessions
Leaderboard (for gaming)
Distributed states
Relieve pressure on databases (such as RDS)
Pub / Sub capability for messaging
Multi AZ with Automatic Failover for disaster recovery if you don’t want to lose your cache data
Support for Read Replicas
Memcached Overview
Memcached is an in-memory object store
Cache doesn’t survive reboots
Use cases:
Quick retrieval of objects from memory
Cache often accessed objects
Overall, Redis has largely grown in popularity and has better feature sets than Memcached.
I would personally only use Redis for caching needs.
questions
Your big data application is taking a lot of files from your local on-premise NFS storage and inserting them into S3.
As part of the data integrity verification process, the application downloads the files right after they’ve been uploaded. What will happen?:The application will receive a 200 as S3 for new PUT is strongly consistent. S3 is eventually consistent only for DELETE or overwrite PUT
You are gathering various files from providers and plan on analyzing them once every month using Athena, which must return the query results immediately. You do not want to run a high risk of losing files and want to minimise costs. Which storage type do you recommend?:S3 Infrequent Access.
As part of your compliance as a bank, you must archive all logs created by all applications and ensure they cannot be modified or deleted for at least 7 years. Which solution should you use?:Glacier with a Vault Lock Policy
You are generating thumbnails in S3 from images. Images are in the images/ directory while thumbnails in the thumbnails/ directory. After running some analytics, you realized that images are rarely read and you could optimise your costs by moving them to another S3 storage tiers. What do you recommend that requires the least amount of changes?:Create a Lifecycle Rule for the images/ prefix
In order to perform fast big data analytics, it has been recommended by your analysts in Japan to continuously copy data from your S3 bucket in us-east-1. How do you recommend doing this at a minimal cost?: Enable cross region replication.
Your big data application is taking a lot of files from your local on-premise NFS storage and inserting them into S3. As part of the data integrity verification process, you would like to ensure the files have been properly uploaded at minimal cost. How do you proceed?:Compute the local ETag for each file and compare them with AWS S3’s ETag
Your application plans to have 15,000 reads and writes per second to S3 from thousands of device ids. Which naming convention do you recommend?:/yyyy-mm-dd/. you get about 3k reads per second per prefix, so using the device-id will help having many prefixes and parallelize your writes.
You are looking to have your files encrypted in S3 and do not want to manage the encryption yourself. You would like to have control over the encryption keys and ensure they’re securely stored in AWS. What encryption do you recommend?:SSE-KMS
Your website is deployed and sources its images from an S3 bucket. Everything works fine on the internet, but when you start the website locally to do some development, the images are not getting loaded. What’s the problem?:S3 CORS
What’s the maximum number of fields that can make a primary key in DynamoDB?:partition key + sort key. So 2.
What’s the maximum size of a row in DynamoDB ?400KB
You are writing item of 8 KB in size at the rate of 12 per seconds. What WCU do you need? 8*12
You are doing strongly consistent read of 10 KB items at the rate of 10 per second. What RCU do you need?10 KB gets rounded to 12 KB, divided by 4KB = 3, times 10 per second = 30
You are doing 12 eventually consistent reads per second, and each item has a size of 16 KB. What RCU do you need?: we can do 2 eventually consistent reads per seconds for items of 4 KB with 1 RCU. 12/2*16/4
We are getting a ProvisionedThroughputExceededExceptions but after checking the metrics, we see we haven’t exceeded the total RCU we had provisioned. What happened?:We have a hot partition / hot key.remember RCU and WCU are spread across all partitions.
You are about to enter the Christmas sale and you know a few items in your website are very popular and will be read often. Last year you had a ProvisionedThroughputExceededException. What should you do this year?:Create a DAX cluster.
You would like to react in real-time to users de-activating their account and send them an email to try to bring them back. The best way of doing it is to:Integrate Lambda with a DynamoDB stream.
You would like to have DynamoDB automatically delete old data for you. What should you use?:Use TTL.
You are looking to improve the performance of your RDS database by caching some of the most common rows and queries. Which technology do you recommend?:ElastiCache