sveska | AWS Advanced S3 and Athena

AWS Advanced S3 and Athena

[last update: 16 Sep 2020]

S3 Advanced

S3 MFA-Delete

MFA (multi factor authentication) forces user to generate a code on a device (usually a mobile phone or hardware) before doing important operations on S3
To use MFA-Delete, enable Versioning on the S3 bucket
You will need MFA to : permanently delete an object version, suspend versioning on the bucket
You won’t need MFA for: enabling versioning, listing deleted versions
Only the bucket owner (root account) can enable/disable MFA-Delete, MFA-Delete currently can only be enabled using the CLI
S3 Default Encryption vs Bucket Policies
The old way to enable default encryption was to use a bucket policy and refuse any HTTP command without the proper headers
The new way is to use the “default encryption” option in S3
Note: Bucket Policies are evaluated before “default encryption”
S3 Access Logs
For audit purpose, you may want to log all access to S3 buckets. Any request made to S3, from any account, authorized or denied, will be logged into another S3 bucket.
That data can be analyzed using data analysis tools or Amazon Athena.
The log format is at: https://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html
S3 Access Logs: Warning: Do not set your logging bucket to be the monitored bucket. It will create a logging loop, and your bucket will grow in size exponentially.
S3 Replication (CRR & SRR)
Must enable versioning in source and destination
Cross Region Replication (CRR) - Use cases: compliance, lower latency access, replication across accounts
Same Region Replication (SRR) - Use cases: log aggregation, live replication between production and test accounts
Buckets can be in different accounts. Copying is asynchronous. Must give proper IAM permissions to S3
After activating, only new objects are replicated (not retroactive)
For DELETE operations: If you delete without a version ID, it adds a delete marker, not replicated. If you delete with a version ID, it deletes in the source, not replicated.
There is no “chaining” of replication
S3 Pre-Signed URLs
Can generate pre-signed URLs using SDK or CLI. For downloads (easy, can use the CLI). For uploads (harder, must use the SDK)
Valid for a default of 3600 seconds, can change timeout with –expires-in [TIME_BY_SECONDS] argument
Users given a pre-signed URL inherit the permissions of the person who generated the URL for GET / PUT
Use case: only logged in users for premium video on s3. Allow an list of users to download files by URL dynamicly. Allow temporary a user to upload a file to a precise location.
aws configure set default.s3.signature_version s3v4
aws s3 presign s3:/link –expires-in 100 –region eu-west-1

S3 Storage Classes

https://aws.amazon.com/s3/storage-classes/
Amazon S3 Standard - General Purpose
High durability (99.999999999%) of objects across multiple AZ, 99.99% Availability over a given year
If you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years
Sustain 2 concurrent facility failures
Use Cases: Big Data analytics, mobile & gaming applications, content distribution…

Amazon S3 Standard-Infrequent Access (IA)

Suitable for data that is less frequently accessed, but requires rapid access when needed
High durability (99.999999999%) of objects across multiple AZs, 99.9% Availability
Low cost compared to Amazon S3 Standard
Sustain 2 concurrent facility failures
Use Cases: As a data store for disaster recovery, backups…

Amazon S3 One Zone-Infrequent Access

Same as IA but data is stored in a single AZ
High durability (99.999999999%) of objects in a single AZ; data lost when AZ is destroyed, 99.5% Availability
Low latency and high throughput performance
Supports SSL for data at transit and encryption at rest
Low cost compared to IA (by 20%)
Use Cases: Storing secondary backup copies of on-premise data, or storing data you can recreate

Amazon S3 Intelligent Tiering

Same low latency and high throughput performance of S3 Standard
Small monthly monitoring and auto-tiering fee
Automatically moves objects between two access tiers based on changing access patterns
Designed for durability of 99.999999999% of objects across multiple Availability Zones
Resilient against events that impact an entire Availability Zone
Designed for 99.9% availability over a given year

Amazon Glacier

Low cost object storage meant for archiving / backup
Data is retained for the longer term (10s of years)
Alternative to on-premise magnetic tape storage
Average annual durability is 99.999999999%
Cost per storage per month ($0.004 / GB) + retrieval cost
Each item in Glacier is called “Archive” (up to 40TB), Archives are stored in ”Vaults”
Amazon Glacier – 3 retrieval options: Expedited (1 to 5 minutes),Standard (3 to 5 hours),Bulk (5 to 12 hours),
Minimum storage duration of 90 days

Amazon Glacier Deep Archive

Amazon Glacier Deep Archive – for long term storage – cheaper:Standard (12 hours), Bulk (48 hours),
Minimum storage duration of 180 days

S3 Lifecycle policies

You can transition objects between storage classes. Moving objects can be automated using a lifecycle configuration
For infrequently accessed object, move them to STANDARD_IA
For archive objects you don’t need in real-time, GLACIER or DEEP_ARCHIVE
Transition actions: It defines when objects are transitioned to another storage class.Move objects to Standard IA class 60 days after creation. Move to Glacier for archiving after 6 months
Expiration actions: configure objects to expire (delete) after some time
Access log files can be set to delete after a 365 days
Can be used to delete old versions of files (if versioning is enabled). Can be used to delete incomplete multi-part uploads
Rules can be created for a certain prefix (ex - s3://mybucket/mp3/*). Rules can be created for certain objects tags (ex - Department: Finance)
exp: S3 source images can be on STANDARD, with a lifecycle configuration to transition them to GLACIER after 45 days. S3 thumbnails can be on ONEZONE_IA, with a lifecycle configuration to expire them (delete them) after 45 days.
exp: You need to enable S3 versioning in order to have object versions, so that “deleted objects” are in fact hidden by a “delete marker” and can be recovered. You can transition these “noncurrent versions” of the object to S3_IA. You can transition afterwards these “noncurrent versions” to DEEP_ARCHIVE

S3 baseline performance

Amazon S3 automatically scales to high request rates, latency 100-200 ms
Your application can achieve at least 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket.
There are no limits to the number of prefixes in a bucket. • Example (object path => prefix): • bucket/folder1/sub1/file => /folder1/sub1/ • bucket/folder1/sub2/file => /folder1/sub2/ • bucket/1/file => /1/ • bucket/2/file => /2/
If you spread reads across all four prefixes evenly, you can achieve 22,000 requests per second for GET and HEAD
S3 KMS limitation
if you use SSE-KMS, you may be impacted by the KMS limits. Count towards the KMS quota per second (5500, 10000, 30000 req/s based on region)
When you upload, it calls the GenerateDataKey KMS API. When you download, it calls the Decrypt KMS API
As of today, you cannot request a quota increase for KMS
S3 Performance
Multi-Part upload: recommended for files > 100MB, must use for files > 5GB. Can help parallelize uploads (speed up transfers)
S3 Transfer Acceleration (upload only): Increase transfer speed by transferring file to an AWS edge location which will forward the data to the S3 bucket in the target region. Compatible with multi-part upload.
S3 Byte-Range Fetches-Parallelize GETs by requesting specific byte ranges. Better resilience in case of failures

S3 and Glacier Select

Retrieve less data using SQL by performing server side filtering
Can filter by rows & columns (simple SQL statements); like filtering CSV files
Less network transfer, less CPU cost client-side
https://aws.amazon.com/blogs/aws/s3-glacier-select/

S3 Lock Policies and Glacier Lock

S3 Object Lock: Adopt a WORM (Write Once Read Many) model
Block an object version deletion for a specified amount of time
Glacier Vault Lock: Adopt a WORM (Write Once Read Many) model
Lock the policy for future edits (can no longer be changed)
Helpful for compliance and data retention

S3 Event Notification

S3:ObjectCreated, S3:ObjectRemoved, S3:ObjectRestore, S3:Replication…
Object name filtering possible (*.jpg)
Use case: generate thumbnails of images uploaded to S3
Can create as many “S3 events” as desired
S3 event notifications typically deliver events in seconds but can sometimes take a minute or longer
If two writes are made to a single non-versioned object at the same time, it is possible that only a single event notification will be sent
If you want to ensure that an event notification is sent for every successful write, you can enable versioning on your bucket.

AWS Athena

Serverless service to perform analytics directly against S3 files
Uses SQL language to query the files
Has a JDBC / ODBC driver
Charged per query and amount of data scanned
Supports CSV, JSON, ORC, Avro, and Parquet (built on Presto)
Use cases: Business intelligence / analytics / reporting, analyze & query VPC Flow Logs, ELB Logs, CloudTrail trails, etc…
Exam Tip: Analyze data directly on S3 => use Athena

questions

MFA Delete forces users to use MFA tokens before deleting objects. It’s an extra level of security to prevent accidental deletes
S3 CRR is used to replicate data from an S3 bucket to another one in a different region
Pre-Signed URL are temporary and grant time-limited access to some actions in your S3 bucket.
When a file is over 100 MB, Multi Part upload is recommended as it will upload many parts in parallel, maximizing the throughput of your bandwidth and also allowing for a smaller part to retry in case that part fails.
Athena -Serverless data analysis service allowing you to query data in S3

AWS Advanced S3 and Athena

S3 Advanced

S3 MFA-Delete

S3 Default Encryption vs Bucket Policies

S3 Access Logs

S3 Replication (CRR & SRR)

S3 Pre-Signed URLs

S3 Storage Classes

Amazon S3 Standard - General Purpose

Amazon S3 Standard-Infrequent Access (IA)

Amazon S3 One Zone-Infrequent Access

Amazon S3 Intelligent Tiering

Amazon Glacier

Amazon Glacier Deep Archive

S3 Lifecycle policies

S3 baseline performance

S3 KMS limitation

S3 Performance

S3 and Glacier Select

S3 Lock Policies and Glacier Lock

S3 Event Notification

AWS Athena

questions