sveska

AWS DAS Storage

AWS S3 Overview - Buckets

  • Amazon S3 allows people to store objects (files) in “buckets”(directories)
  • Buckets must have a globally unique name
  • Buckets are defined at the region level
  • Naming convention
  • No uppercase
  • No underscore
  • 3-63 characters long
  • Not an IP
  • Must start with lowercase letter or number

    AWS S3 Overview - Objects

  • Objects (files) have a Key. The key is the FULL path:/my_file.txt,/my_folder1/another_folder/my_file.txt
  • There’s no concept of “directories” within buckets (although the UI will trick you to think otherwise)
  • Just keys with very long names that contain slashes (“/”)
  • Object Values are the content of the body:
  • Max Size is 5TB
  • If uploading more than 5GB, must use “multi-part upload”
  • Metadata (list of text key / value pairs – system or user metadata)
  • Tags (Unicode key / value pair – up to 10) – useful for security / lifecycle
  • Version ID (if versioning is enabled)

    AWS S3 - Consistency Model

  • Read after write consistency for PUTS of new objects
  • As soon as an object is written, we can retrieve it ex: (PUT 200 -> GET 200)
  • This is true, except if we did a GET before to see if the object existed ex: (GET 404 -> PUT 200 -> GET 404) – eventually consistent
  • Eventual Consistency for DELETES and PUTS of existing objects
  • If we read an object after updating, we might get the older version ex: (PUT 200 -> PUT 200 -> GET 200 (might be older version))
  • If we delete an object, we might still be able to retrieve it for a short time ex: (DELETE 200 -> GET 200)

    S3 Storage Classes

  • Amazon S3 Standard - General Purpose
  • Amazon S3 Standard-Infrequent Access (IA)
  • Amazon S3 One Zone-Infrequent Access
  • Amazon S3 Intelligent Tiering
  • Amazon Glacier
  • Amazon Glacier Deep Archive
  • Amazon S3 Reduced Redundancy Storage (deprecated - omitted)

    S3 Standard – General Purpose

  • High durability (99.999999999%) of objects across multiple AZ
  • If you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years
  • 99.99% Availability over a given year
  • Sustain 2 concurrent facility failures
  • Use Cases: Big Data analytics, mobile & gaming applications, content distribution…

    S3 Standard – Infrequent Access (IA)

  • Suitable for data that is less frequently accessed, but requires rapid access when needed
  • High durability (99.999999999%) of objects across multiple AZs
  • 99.9% Availability
  • Low cost compared to Amazon S3 Standard
  • Sustain 2 concurrent facility failures
  • Use Cases: As a data store for disaster recovery, backups…

    S3 One Zone - Infrequent Access (IA)

  • Same as IA but data is stored in a single AZ
  • High durability (99.999999999%) of objects in a single AZ; data lost when AZ is destroyed
  • 99.5% Availability
  • Low latency and high throughput performance
  • Supports SSL for data at transit and encryption at rest
  • Low cost compared to IA (by 20%)
  • Use Cases: Storing secondary backup copies of on-premise data, or storing data you can recreate

    S3 Intelligent Tiering

  • Same low latency and high throughput performance of S3 Standard
  • Small monthly monitoring and auto-tiering fee
  • Automatically moves objects between two access tiers based on changing access patterns
  • Designed for durability of 99.999999999% of objects across multiple Availability Zones
  • Resilient against events that impact an entire Availability Zone
  • Designed for 99.9% availability over a given year

    Amazon Glacier

  • Low cost object storage meant for archiving / backup
  • Data is retained for the longer term (10s of years)
  • Alternative to on-premise magnetic tape storage
  • Average annual durability is 99.999999999%
  • Cost per storage per month ($0.004 / GB) + retrieval cost
  • Each item in Glacier is called “Archive” (up to 40TB)
  • Archives are stored in ”Vaults”

    Amazon Glacier & Glacier Deep Archive

  • Amazon Glacier – 3 retrieval options:
  • Expedited (1 to 5 minutes)
  • Standard (3 to 5 hours)
  • Bulk (5 to 12 hours)
  • Minimum storage duration of 90 days
  • Amazon Glacier Deep Archive – for long term storage – cheaper:
  • Standard (12 hours)
  • Bulk (48 hours)
  • Minimum storage duration of 180 days

    S3 Storage Classes Comparison

    S3 – Moving between storage classes

  • You can transition objects between storage classes
  • For infrequently accessed object, move them to STANDARD_IA
  • For archive objects you don’t need in real-time, GLACIER or DEEP_ARCHIVE
  • Moving objects can be automated using a lifecycle configuration

    S3 Lifecycle Rules

  • Transition actions: It defines when objects are transitioned to another storage class.
  • Move objects to Standard IA class 60 days after creation
  • Move to Glacier for archiving after 6 months
  • Expiration actions: configure objects to expire (delete) after some time
  • Access log files can be set to delete after a 365 days
  • Can be used to delete old versions of files (if versioning is enabled)
  • Can be used to delete incomplete multi-part uploads
  • Rules can be created for a certain prefix (ex - s3://mybucket/mp3/*)
  • Rules can be created for certain objects tags (ex - Department: Finance)

    AWS S3 - Versioning

  • You can version your files in AWS S3
  • It is enabled at the bucket level
  • Same key overwrite will increment the “version”: 1, 2, 3….
  • It is best practice to version your buckets
  • Protect against unintended deletes (ability to restore a version)
  • Easy roll back to previous version
  • Any file that is not versioned prior to enabling versioning will have version “null”
  • You can “suspend” versioning

    S3 Cross Region Replication

  • Must enable versioning (source and destination)
  • Buckets must be in different AWS regions • Can be in different accounts
  • Asynchronous • Copying is asynchronous replication
  • Must give proper IAM permissions to S3
  • Use cases: compliance, lower latency access, replication across accounts

    AWS S3 – ETag (Entity Tag)

  • How do you verify if a file has already been uploaded to S3?
  • Names work, but how are you sure the file is exactly the same?
  • For this, you can use AWS ETags:
  • For simple uploads (less than 5GB), it’s the MD5 hash
  • For multi-part uploads, it’s more complicated, no need to know the algorithm
  • Using ETag, we can ensure integrity of files

    AWS S3 Performance – Key Names Historic fact and current exam

  • When you had > 100 TPS (transaction per second), S3 performance could degrade
  • Behind the scene, each object goes to an S3 partition and for the best performance, we want the highest partition distribution
  • In the exam, and historically, it was recommended to have random characters in front of your key name to optimise performance:/5r4d_my_folder/my_file1.txt,/a91e_my_folder/my_file2.txt
  • It was recommended never to use dates to prefix keys:/2018_09_09_my_folder/my_file1.txt,/2018_09_10_my_folder/my_file2.txt

    AWS S3 Performance – Key Names Current performance (not yet exam)

  • https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
  • As of July 17th 2018, we can scale up to 3500 RPS for PUT and 5500 RPS for GET for EACH PREFIX
  • “This S3 request rate performance increase removes any previous guidance to randomize object prefixes to achieve faster performance”
  • It’s a “good to know”, until the exam gets updated ☺

    AWS S3 Performance

  • Faster upload of large objects (>5GB), use multipart upload:
  • parallelizes PUTs for greater throughput
  • maximize your network bandwidth
  • decrease time to retry in case a part fails
  • Use CloudFront to cache S3 objects around the world (improves reads)
  • S3 Transfer Acceleration (uses edge locations) – just need to change the endpoint you write to, not the code.
  • If using SSE-KMS encryption, you may be limited to your AWS limits for KMS usage (~100s – 1000s downloads / uploads per second)

    S3 Encryption for Objects

  • There are 4 methods of encrypting objects in S3
  • SSE-S3: encrypts S3 objects using keys handled & managed by AWS
  • SSE-KMS: leverage AWS Key Management Service to manage encryption keys
  • SSE-C: when you want to manage your own encryption keys
  • Client Side Encryption
  • It’s important to understand which ones are adapted to which situation for the exam

    S3 Encryption for Objects: SSE-S3

  • SSE-S3: encryption using keys handled & managed by AWS S3
  • Object is encrypted server side
  • AES-256 encryption type Must set header: “x-amz-server-side-encryption”: “AES256”

    S3 Encryption for Objects: SSE-KMS

  • SSE-KMS: encryption using keys handled & managed by KMS
  • KMS Advantages: user control + audit trail
  • Object is encrypted server side
  • Must set header: “x-amz-server-side-encryption”: ”aws:kms”

    S3 Encryption for Objects: SSE-C

  • SSE-C: server-side encryption using data keys fully managed by the customer outside of AWS
  • Amazon S3 does not store the encryption key you provide
  • HTTPS must be used
  • Encryption key must provided in HTTP headers, for every HTTP request made

    S3 Encryption for Objects: Client Side Encryption

  • Client library such as the Amazon S3 Encryption Client
  • Clients must encrypt data themselves before sending to S3
  • Clients must decrypt data themselves when retrieving from S3
  • Customer fully manages the keys and encryption cycle

    S3 Encryption for Objects: Encryption in transit (SSL)

  • AWS S3 exposes:
  • HTTP endpoint: non encrypted
  • HTTPS endpoint: encryption in flight
  • You’re free to use the endpoint you want, but HTTPS is recommended
  • HTTPS is mandatory for SSE-C
  • Encryption in flight is also called SSL / TLS

    S3 CORS (Cross-Origin Resource Sharing)

  • If you request data from another website, you need to enable CORS
  • Cross Origin Resource Sharing allows you to limit the number of websites that can request your files in S3 (and limit your costs)
  • It’s a popular exam question

    S3 Access Logs

  • For audit purpose, you may want to log all access to S3 buckets
  • Any request made to S3, from any account, authorized or denied, will be logged into another S3 bucket
  • That data can be analyzed using data analysis tools…
  • Or Amazon Athena as we’ll see later in this course!
  • The log format is at: https://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html

    S3 Security

  • User based
  • IAM policies - which API calls should be allowed for a specific user from IAM console
  • Resource Based
  • Bucket Policies - bucket wide rules from the S3 console - allows cross account
  • Object Access Control List (ACL) – finer grain
  • Bucket Access Control List (ACL) – less common

    S3 Bucket Policies

  • JSON based policies
  • Resources: buckets and objects
  • Actions: Set of API to Allow or Deny
  • Effect: Allow / Deny
  • Principal: The account or user to apply the policy to
  • Use S3 bucket for policy to:
  • Grant public access to the bucket
  • Force objects to be encrypted at upload
  • Grant access to another account (Cross Account)

    S3 Default Encryption vs Bucket Policies

  • The old way to enable default encryption was to use a bucket policy and refuse any HTTP command without the proper headers:
  • The new way is to use the “default encryption” option in S3
  • Note: Bucket Policies are evaluated before “default encryption”

    S3 Security - Other

  • Networking: Supports VPC Endpoints (for instances in VPC without www internet)
  • Logging and Audit: S3 access logs can be stored in other S3 bucket API calls can be logged in AWS CloudTrail
  • User Security: MFA (multi factor authentication) can be required in versioned buckets to delete objects
  • Signed URLs: URLs that are valid only for a limited time (ex: premium video service for logged in users)

    Glacier

  • Low cost object storage meant for archiving / backup
  • Data is retained for the longer term (10s of years)
  • Alternative to on-premise magnetic tape storage
  • Average annual durability is 99.999999999%
  • Cost per storage per month ($0.004 / GB) + retrieval cost
  • Each item in Glacier is called “Archive” (up to 40TB)
  • Archives are stored in ”Vaults”
  • Exam tip: archival from S3 after XXX days => use Glacier

    Glacier Operations

  • Restore links have an expiry date
  • 3 retrieval options:
  • Expedited (1 to 5 minutes retrieval) – $0.03 per GB and $0.01 per request
  • Standard (3 to 5 hours) - $0.01 per GB and 0.05 per 1000 requests
  • Bulk (5 to 12 hours) - $0.0025 per GB and $0.025 per 1000 requests

    Glacier - Vault Policies & Vault Lock

  • Vault is a collection of archives
  • Each Vault has:
  • ONE vault access policy
  • ONE vault lock policy
  • Vault Policies are written in JSON
  • Vault Access Policy is similar to bucket policy (restrict user / account permissions)
  • Vault Lock Policy is a policy you lock, for regulatory and compliance requirements.
  • The policy is immutable, it can never be changed (that’s why it’s call LOCK)
  • Example 1: forbid deleting an archive if less than 1 year old
  • Example 2: implement WORM policy (write once read many)

    S3 Select & Glacier Select

  • Retrieve less data using SQL by performing server side filtering
  • Can filter by rows & columns (simple SQL statements)
  • Less network transfer, less CPU cost client-side

    S3 Select with Hadoop

  • Transfer some data from S3 before analyzing it with your cluster
  • Load less data into Hadoop, save network costs, transfer the data faster
  • CSV file
  • Send filtered dataset
  • Amazon S3 Using S3 Select
  • Server-side filtering

    DynamoDB

  • Fully Managed, Highly available with replication across 3 AZ
  • NoSQL database - not a relational database
  • Scales to massive workloads, distributed database
  • Millions of requests per seconds, trillions of row, 100s of TB of storage
  • Fast and consistent in performance (low latency on retrieval)
  • Integrated with IAM for security, authorization and administration
  • Enables event driven programming with DynamoDB Streams
  • Low cost and auto scaling capabilities

    DynamoDB - Basics

  • DynamoDB is made of tables
  • Each table has a primary key (must be decided at creation time)
  • Each table can have an infinite number of items (= rows)
  • Each item has attributes (can be added over time – can be null)
  • Maximum size of a item is 400KB
  • Data types supported are:
  • Scalar Types: String, Number, Binary, Boolean, Null
  • Document Types: List, Map
  • Set Types: String Set, Number Set, Binary Set

    DynamoDB – Primary Keys

  • Option 1: Partition key only (HASH) Partition key must be unique for each item Partition key must be “diverse” so that the data is distributed Example: user_id for a users table
  • Option 2: Partition key + Sort Key The combination must be unique Data is grouped by partition key Sort key == range key Example: users-games table user_id for the partition key game_id for the sort key

    DynamoDB – Partition Keys exercise

  • We’re building a movie database
  • What is the best partition key to maximize data distribution? • movie_id • producer_name • leader_actor_name • movie_language • movie_id has the highest cardinality so it’s a good candidate • moving_language doesn’t take many values and may be skewed towards English so it’s not a great partition key

    DynamoDB in Big Data

  • Common use cases include: • Mobile apps • Gaming • Digital ad serving • Live voting • Audience interaction for live events • Sensor networks • Log ingestion • Access control for web-based content • Metadata storage for Amazon S3 objects • E-commerce shopping carts • Web session management

    Anti Pattern

  • Prewritten application tied to a traditional relational database: use RDS instead
  • Joins or complex transactions
  • Binary Large Object (BLOB) data: store data in S3 & metadata in DynamoDB
  • Large data with low I/O rate: use S3 instead

    DynamoDB – Provisioned Throughput

  • Table must have provisioned read and write capacity units
  • Read Capacity Units (RCU): throughput for reads
  • Write Capacity Units (WCU): throughput for writes
  • Option to setup auto-scaling of throughput to meet demand
  • Throughput can be exceeded temporarily using “burst credit”
  • If burst credit are empty, you’ll get a “ProvisionedThroughputException”.
  • It’s then advised to do an exponential back-off retry

    DynamoDB – Write Capacity Units

  • One write capacity unit represents one write per second for an item up to 1 KB in size.
  • If the items are larger than 1 KB, more WCU are consumed
  • Example 1: we write 10 objects per seconds of 2 KB each.
  • We need 2 * 10 = 20 WCU
  • Example 2: we write 6 objects per second of 4.5 KB each
  • We need 6 * 5 = 30 WCU (4.5 gets rounded to the upper KB)
  • Example 3: we write 120 objects per minute of 2 KB each
  • We need 120 / 60 * 2 = 4 WCU

    Strongly Consistent Read vs Eventually Consistent Read

  • Eventually Consistent Read: If we read just after a write, it’s possible we’ll get unexpected response because of replication
  • Strongly Consistent Read: If we read just after a write, we will get the correct data
  • By default: DynamoDB uses Eventually Consistent Reads, but GetItem, Query & Scan provide a “ConsistentRead” parameter you can set to True

    DynamoDB – Read Capacity Units

  • One read capacity unit represents one strongly consistent read per second, or two eventually consistent reads per second, for an item up to 4 KB in size.
  • If the items are larger than 4 KB, more RCU are consumed
  • Example 1: 10 strongly consistent reads per seconds of 4 KB each
  • We need 10 * 4 KB / 4 KB = 10 RCU
  • Example 2: 16 eventually consistent reads per seconds of 12 KB each
  • We need (16 / 2) * ( 12 / 4 ) = 24 RCU
  • Example 3: 10 strongly consistent reads per seconds of 6 KB each
  • We need 10 * 8 KB / 4 = 20 RCU (we have to round up 6 KB to 8 KB)

    DynamoDB - Throttling

  • If we exceed our RCU or WCU, we get
  • ProvisionedThroughputExceededExceptions
  • Reasons: • Hot keys / partitions: one partition key is being read too many times (popular item for ex) • Very large items: remember RCU and WCU depends on size of items
  • Solutions: • Exponential back-off when exception is encountered (already in SDK) • Distribute partition keys as much as possible • If RCU issue, we can use DynamoDB Accelerator (DAX)

    DynamoDB – Partitions Internal

    DynamoDB – Writing Data

  • PutItem - Write data to DynamoDB (create data or full replace) • Consumes WCU
  • UpdateItem – Update data in DynamoDB (partial update of attributes) • Possibility to use Atomic Counters and increase them
  • Conditional Writes: • Accept a write / update only if conditions are respected, otherwise reject • Helps with concurrent access to items
  • No performance impact

    DynamoDB – Deleting Data

  • DeleteItem • Delete an individual row • Ability to perform a conditional delete
  • DeleteTable • Delete a whole table and all its items
  • Much quicker deletion than calling DeleteItem on all items

    DynamoDB – Batching Writes

  • BatchWriteItem
  • Up to 25 PutItem and / or DeleteItem in one call
  • Up to 16 MB of data written
  • Up to 400 KB of data per item
  • Batching allows you to save in latency by reducing the number of API calls done against DynamoDB
  • Operations are done in parallel for better efficiency
  • It’s possible for part of a batch to fail, in which case we have the try the failed items (using exponential back-off algorithm)

    DynamoDB – Reading Data

  • GetItem:
  • Read based on Primary key
  • Primary Key = HASH or HASH-RANGE
  • Eventually consistent read by default
  • Option to use strongly consistent reads (more RCU - might take longer)
  • ProjectionExpression can be specified to include only certain attributes
  • BatchGetItem: Up to 100 items Up to 16 MB of data Items are retrieved in parallel to minimize latency

    DynamoDB – Query

  • Query returns items based on:
  • PartitionKey value (must be = operator)
  • SortKey value (=, <, <=, >, >=, Between, Begin) – optional
  • FilterExpression to further filter (client side filtering)
  • Returns:
  • Up to 1 MB of data
  • Or number of items specified in Limit
  • Able to do pagination on the results
  • Can query table, a local secondary index, or a global secondary index

    DynamoDB - Scan

  • Scan the entire table and then filter out data (inefficient)
  • Returns up to 1 MB of data – use pagination to keep on reading
  • Consumes a lot of RCU
  • Limit impact using Limit or reduce the size of the result and pause
  • For faster performance, use parallel scans:
  • Multiple instances scan multiple partitions at the same time
  • Increases the throughput and RCU consumed
  • Limit the impact of parallel scans just like you would for Scans
  • Can use a ProjectionExpression + FilterExpression (no change to RCU)

    DynamoDB – LSI (Local Secondary Index)

  • Alternate range key for your table, local to the hash key
  • Up to five local secondary indexes per table.
  • The sort key consists of exactly one scalar attribute.
  • The attribute that you choose must be a scalar String, Number, or Binary
  • LSI must be defined at table creation time

    DynamoDB – GSI (Global Secondary Index)

  • To speed up queries on non-key attributes, use a Global Secondary Index
  • GSI = partition key + optional sort key
  • The index is a new “table” and we can project attributes on it
  • The partition key and sort key of the original table are always projected (KEYS_ONLY)
  • Can specify extra attributes to project (INCLUDE)
  • Can use all attributes from main table (ALL)
  • Must define RCU / WCU for the index
  • Possibility to add / modify GSI (not LSI)

    DynamoDB - DAX

  • DAX = DynamoDB Accelerator
  • Seamless cache for DynamoDB, no application re-write
  • Writes go through DAX to DynamoDB
  • Micro second latency for cached reads & queries
  • Solves the Hot Key problem (too many reads)
  • 5 minutes TTL for cache by default
  • Up to 10 nodes in the cluster
  • Multi AZ (3 nodes minimum recommended for production)
  • Secure (Encryption at rest with KMS, VPC, IAM, CloudTrail…)

    DynamoDB Streams

  • Changes in DynamoDB (Create, Update, Delete) can end up in a DynamoDB Stream
  • This stream can be read by AWS Lambda, and we can then do:
  • React to changes in real time (welcome email to new users)
  • Create derivative tables / views
  • Insert into ElasticSearch
  • Could implement Cross Region Replication using Streams
  • Stream has 24 hours of data retention
  • Configurable batch size (up to 1,000 rows, 6 MB)

    DynamoDB Streams Kinesis Adapter

  • Use the KCL library to directly consume from DynamoDB Streams
  • You just need to add a “Kinesis Adapter” library
  • The interface and programming is exactly the same as Kinesis Streams
  • That’s the alternative to using AWS Lambda

    DynamoDB TTL (Time to Live)

  • TTL = automatically delete an item after an expiry date / time
  • TTL is provided at no extra cost, deletions do not use WCU / RCU
  • TTL is a background task operated by the DynamoDB service itself
  • Helps reduce storage and manage the table size over time
  • Helps adhere to regulatory norms
  • TTL is enabled per row (you define a TTL column, and add a date there)
  • DynamoDB typically deletes expired items within 48 hours of expiration
  • Deleted items due to TTL are also deleted in GSI / LSI
  • DynamoDB Streams can help recover expired items

    DynamoDB – Security & Other Features

  • Security:
  • VPC Endpoints available to access DynamoDB without internet
  • Access fully controlled by IAM
  • Encryption at rest using KMS
  • Encryption in transit using SSL / TLS
  • Backup and Restore feature available
  • Point in time restore like RDS
  • No performance impact
  • Global Tables
  • Multi region, fully replicated, high performance
  • Amazon Database Migration Service (DMS) can be used to migrate to DynamoDB (from Mongo, Oracle, MySQL, S3, etc…)
  • You can launch a local DynamoDB on your computer for development purposes

    DynamoDB – Storing large objects

  • Max size of an item in DynamoDB = 400 KB
  • For large objects, store them in S3 and reference them in DynamoDB Amazon S3

    AWS ElastiCache Overview

  • The same way RDS is to get managed Relational Databases…
  • ElastiCache is to get managed Redis or Memcached
  • Caches are in-memory databases with really high performance, low latency
  • Helps reduce load off of databases for read intensive workloads
  • Helps make your application stateless
  • Write Scaling using sharding
  • Read Scaling using Read Replicas
  • Multi AZ with Failover Capability
  • AWS takes care of OS maintenance / patching, optimizations, setup, configuration, monitoring, failure recovery and backups

    Redis Overview

  • Redis is an in-memory key-value store
  • Super low latency (sub ms)
  • Cache survive reboots by default (it’s called persistence)
  • Great to host
  • User sessions
  • Leaderboard (for gaming)
  • Distributed states
  • Relieve pressure on databases (such as RDS)
  • Pub / Sub capability for messaging
  • Multi AZ with Automatic Failover for disaster recovery if you don’t want to lose your cache data
  • Support for Read Replicas

    Memcached Overview

  • Memcached is an in-memory object store
  • Cache doesn’t survive reboots
  • Use cases:
  • Quick retrieval of objects from memory
  • Cache often accessed objects
  • Overall, Redis has largely grown in popularity and has better feature sets than Memcached.
  • I would personally only use Redis for caching needs.

questions

  • Your big data application is taking a lot of files from your local on-premise NFS storage and inserting them into S3. As part of the data integrity verification process, the application downloads the files right after they’ve been uploaded. What will happen?:The application will receive a 200 as S3 for new PUT is strongly consistent. S3 is eventually consistent only for DELETE or overwrite PUT
  • You are gathering various files from providers and plan on analyzing them once every month using Athena, which must return the query results immediately. You do not want to run a high risk of losing files and want to minimise costs. Which storage type do you recommend?:S3 Infrequent Access.
  • As part of your compliance as a bank, you must archive all logs created by all applications and ensure they cannot be modified or deleted for at least 7 years. Which solution should you use?:Glacier with a Vault Lock Policy
  • You are generating thumbnails in S3 from images. Images are in the images/ directory while thumbnails in the thumbnails/ directory. After running some analytics, you realized that images are rarely read and you could optimise your costs by moving them to another S3 storage tiers. What do you recommend that requires the least amount of changes?:Create a Lifecycle Rule for the images/ prefix
  • In order to perform fast big data analytics, it has been recommended by your analysts in Japan to continuously copy data from your S3 bucket in us-east-1. How do you recommend doing this at a minimal cost?: Enable cross region replication.
  • Your big data application is taking a lot of files from your local on-premise NFS storage and inserting them into S3. As part of the data integrity verification process, you would like to ensure the files have been properly uploaded at minimal cost. How do you proceed?:Compute the local ETag for each file and compare them with AWS S3’s ETag
  • Your application plans to have 15,000 reads and writes per second to S3 from thousands of device ids. Which naming convention do you recommend?:/yyyy-mm-dd/. you get about 3k reads per second per prefix, so using the device-id will help having many prefixes and parallelize your writes.
  • You are looking to have your files encrypted in S3 and do not want to manage the encryption yourself. You would like to have control over the encryption keys and ensure they’re securely stored in AWS. What encryption do you recommend?:SSE-KMS
  • Your website is deployed and sources its images from an S3 bucket. Everything works fine on the internet, but when you start the website locally to do some development, the images are not getting loaded. What’s the problem?:S3 CORS
  • What’s the maximum number of fields that can make a primary key in DynamoDB?:partition key + sort key. So 2.
  • What’s the maximum size of a row in DynamoDB ?400KB
  • You are writing item of 8 KB in size at the rate of 12 per seconds. What WCU do you need? 8*12
  • You are doing strongly consistent read of 10 KB items at the rate of 10 per second. What RCU do you need?10 KB gets rounded to 12 KB, divided by 4KB = 3, times 10 per second = 30
  • You are doing 12 eventually consistent reads per second, and each item has a size of 16 KB. What RCU do you need?: we can do 2 eventually consistent reads per seconds for items of 4 KB with 1 RCU. 12/2*16/4
  • We are getting a ProvisionedThroughputExceededExceptions but after checking the metrics, we see we haven’t exceeded the total RCU we had provisioned. What happened?:We have a hot partition / hot key.remember RCU and WCU are spread across all partitions.
  • You are about to enter the Christmas sale and you know a few items in your website are very popular and will be read often. Last year you had a ProvisionedThroughputExceededException. What should you do this year?:Create a DAX cluster.
  • You would like to react in real-time to users de-activating their account and send them an email to try to bring them back. The best way of doing it is to:Integrate Lambda with a DynamoDB stream.
  • You would like to have DynamoDB automatically delete old data for you. What should you use?:Use TTL.
  • You are looking to improve the performance of your RDS database by caching some of the most common rows and queries. Which technology do you recommend?:ElastiCache