Leveraging SQS and Kinesis Stream to build data-pipeline at scale

Fajri Abdillah
Solutions Architect, Serverless Indonesia, PT. Myedisi Interaktif Media
fajri@serverless.id

Pipe

Data Pipeline

A process that moves data from a source to a destination and has the possibilities to transform the data along the way.

Data Producer

A system that responsible to generate data and send it to pipeline

Data Consumer

A system that responsible to retrieve data from pipeline and process it

SQS - Definition

Amazon Simple Queue Service (Amazon SQS) offers a secure, durable, and available hosted queue that lets you integrate and decouple distributed software systems and components.

SQS - Diagram

SQS - Power Features

  1. Serverless - No need to think about server, battle tested since 2006
  2. Unlimited Throughput - Standard queues support a nearly unlimited number of transactions per second (TPS) per API action
  3. Security - Control who can send messages to and receive messages from an Amazon SQS queue
  4. Durability - Make use of dead letter queue, no data is missing, up to 14 days data retention.
  5. Event Source - Lambda can process message in SQS

SQS - Limitations

  1. Single Consumer - once a consumer processed a message then message is removed, replaying messages is not possible
  2. Tricky pricing - Billed per transaction, chunked per 64 KB, but still cheap
  3. Limited payload size - 256 KB, can use S3 for bigger payload, up to 2 GB, but more expensive
  4. Delay queue - Maximum 15 minutes
  5. Maximum batch message - Only 10 messages or 256 KB

SQS - Pricing

Price per 1 Million Requests after Free Tier (Monthly)
Standard Queue $0.40 ($0.00000040 per request)
FIFO Queue $0.50 ($0.00000050 per request)

SQS - Is it good?

SQS - Is it good?

SQS - Real World Implementation

Kinesis Stream - Definition

Amazon Kinesis Data Streams enables you to build custom applications that process or analyze streaming data for specialized needs.

Kinesis Stream - Diagram

Kinesis Stream - Power Features

  1. Fast & Real-time - literally, case from disney
  2. Serverless - No need to think about server, it is scalable
  3. Multiple Consumer - One message can be consumed by multiple consumer
  4. Ordered & Replayable Message - The message is not deleted after processed by consumer
  5. Data Retention - Up to 7 days (more expensive), default is 24 hours

Kinesis Stream - Limitations

  1. Shard Management - This is the hardest part
  2. Limited Read Throughput - Make sure you calculated per shard
  3. Maintaining State - Need to use combination of KCL and dynamoDB
  4. Determining number of shards

						number_of_shards = max(incoming_write_bandwidth_in_KiB/1024, outgoing_read_bandwidth_in_KiB/2048)
						

Kinesis Stream - Pricing

Pricing (Singapore)
Shard Hour (1MB/second ingress, 2MB/second egress) $0.0184 / hour
PUT Payload Units, per 1,000,000 units $0.0195

Kinesis Stream - Pricing - Optional

Pricing (Singapore)
Extended Data Retention (Up to 7 days), per Shard Hour $0.025
Enhanced fan-out data retrievals, per GB $0.0159
Enhanced fan-out, per consumer-shard hour $0.0184

Kinesis Stream - Is it good?

Kinesis Stream - Is it good?

Kinesis Stream - Real World Implementation

Demo

SQS or Kinesis Stream?

Do you need multiple consumer or just one?

Do you need ordered message with high throughput?

Do you need to replay your data in consumer?

SQS or Kinesis Stream?

Do you need to tracks the ack/fail for individual message?

Do you need to delay your message individually?

Do you need a dynamically increasing concurrency/throughput at read time?

Thank You

Join our Facebook Group : Serverless Indonesia
Our Medium Publications : Serverless Indonesia
Our u(n)pdated Blog : blog.serverless.id