sveska

AWS DAS Service integration

AWS Services Integration

  • IoT topic, IoT rules, IoT destinations ->kinesis,dynamodb,sqs,s3,lambda…
  • Kinesis data streams:
  • Producers: SDK, KPL, Kinesis agent, Spark, Kafka Connect
  • Consumers: Spark, Firehouse, Lambda, KCL, SDK, Kinesis connect library
  • Firehouse to deliver data, lambda for transformation:
  • Producers:KPL, Kinesis Agent, CloudWatch logs, IoT rules, Kinesis DataStreams
  • Consumers: S3, RedShift, Elastic, Splunk
  • Kinesis Data Analytics, transform in lambda
  • Producers: Streams, Firehouse, Reference data(csv) in S3
  • Consumerts: Kinesis data streams, Firehouse, Lambda
  • SQS
  • Producers: SDK, IoT, S3
  • Consumers: SDK, Lambda
  • S3
  • Producers: a lot
  • Consumers: SQS, SNS, Lambda
  • DynamoDB, transform AWS pipeline
  • Producers: SDK,DMS
  • Consumers: Glue, EMR(hive), DynamoDB streams, later Lambda, KCL
  • Glue
  • Producers:DynamoDB, S3, JDBC
  • Consumers:RedShift,Athena,EMR+HIVE
  • EMR is Hadoop,Hive,Spark,Presto, Jupyter, Flink
  • Producers: DynamoDB, Apache Ranger, S3/EMRFS, GLUE
  • Machine Learning is deprecated
  • Producers: S3, Readshift
  • Amazon SageMaker- fancy machine learning
  • Producers:S3
  • Consumers:results in notebook
  • AWS data pipeline
  • integrate S3, EMR, JDBC, DynamoDB
  • Elastic
  • Producers:Firehouse,IoT core, CloudWatch logs
  • Athena:
  • Producers:S3, Glue
  • Consumers:S3,QuickSight
  • RedShift:
  • Producers:S3
  • Consumers:QuickSight, PSQL-dblink
  • QuickSight
  • Producers: RedShift, Aurora, JDBC, Athena, S3

    AWS Instance Types

  • General Purpose: T2, T3, M4, M5
  • Compute Optimized: C4, C5 • Batch processing, Distributed analytics, Machine / Deep Learning Inference
  • Memory Optimized: R4, R5, X1, Z1d • High performance database, In memory database, Real time big data analytics
  • Accelerated Computing: P2, P3, G3, F1 • GPU instances, Machine or Deep Learning, High Performance Computing
  • Storage Optimized: H1, I3, D2 • Distributed File System (HDFS), NFS, Map Reduce, Apache Kafka, Redshift
  • exp. Spark more memory, Machine learning accelerated computing, batch compute optimized

    EC2 in Big Data

  • On demand, Spot & Reserved instances: • Spot: can tolerate loss, low cost => checkpointing feature (ML, etc) • Reserved: long running clusters, databases (over a year) • On demand: remaining workloads
  • Auto Scaling: • Leverage for EMR, etc • Automated for DynamoDB, Auto Scaling Groups, etc…
  • EC2 is behind EMR • Master Nodes • Compute Nodes (contain data) + Tasks Nodes (do not contain data)