multicloud365
  • Home
  • Cloud Architecture
    • OCI
    • GCP
    • Azure
    • AWS
    • IAC
    • Cloud Networking
    • Cloud Trends and Innovations
    • Cloud Security
    • Cloud Platforms
  • Data Management
  • DevOps and Automation
    • Tutorials and How-Tos
  • Case Studies and Industry Insights
    • AI and Machine Learning in the Cloud
No Result
View All Result
  • Home
  • Cloud Architecture
    • OCI
    • GCP
    • Azure
    • AWS
    • IAC
    • Cloud Networking
    • Cloud Trends and Innovations
    • Cloud Security
    • Cloud Platforms
  • Data Management
  • DevOps and Automation
    • Tutorials and How-Tos
  • Case Studies and Industry Insights
    • AI and Machine Learning in the Cloud
No Result
View All Result
multicloud365
No Result
View All Result

Builder’s Diary Vol. 2: Serverless ETL with Airflow and Athena

admin by admin
June 30, 2025
in AWS
0
Builder’s Diary Vol. 2: Serverless ETL with Airflow and Athena
399
SHARES
2.3k
VIEWS
Share on FacebookShare on Twitter


Get insights into the day-to-day challenges of builders. On this difficulty, Peter Reitz from our accomplice tecRacer talks about tips on how to construct Serverless ETL (extract, remodel and cargo) pipelines with the assistance of Amazon Managed Workflows for Apache Airflow (MWAA) and Amazon Athena.

Builder's Diary

In case you want a video or podcast as an alternative of studying, right here you go.

Do you like listening to a podcast episode over studying a weblog submit? Right here you go!

What sparked your curiosity in cloud computing?

Computer systems have all the time held an awesome fascination for me. I taught myself tips on how to program. That’s how I ended up working as an online developer throughout my economics research. After I first stumbled upon Amazon Internet Providers, I used to be intrigued by the know-how and vast number of providers.

How did you develop into the position of a cloud guide?

After finishing my economics diploma, I used to be on the lookout for a job. By probability, a job advert drew my consideration to a emptiness for a cloud guide at tecRacer. To be trustworthy, my expertise didn’t match the necessities very nicely. However as a result of I discovered the subject thrilling, I utilized anyway. Proper from the job interview, I felt proper at dwelling at tecRacer in Duisburg. Since I had no expertise with AWS, there was rather a lot to be taught throughout the first months. My first purpose was to realize the AWS Licensed Options Architect – Affiliate certification. Your entire group supported and motivated me throughout this intensive studying section. After that, I joined a small group engaged on a mission for certainly one of our consulting purchasers. This allowed me to achieve sensible expertise at a really early stage.

What does your day-to-day work as a cloud guide at tecRacer appear to be?

As a cloud guide, I work on tasks for our purchasers. I concentrate on machine studying and knowledge analytics. Since tecRacer has a 100% concentrate on AWS, I put money into my data of associated AWS providers like S3, Athena, EMR, SageMaker, and extra. I work remotely or at our workplace in Hamburg and am on the buyer’s web site from time to time. For instance, to research the necessities for a mission in workshops.

What mission are you at the moment engaged on?

I’m at the moment engaged on constructing an ETL pipeline. A number of knowledge suppliers add CSV information to an S3 bucket. My shopper’s problem is to extract and remodel 3 billion knowledge factors and retailer them in a approach that permits environment friendly knowledge analytics. This course of will be roughly described as follows.

  1. Fetch CSV information from S3.
  2. Parse CSV information.
  3. Filter, remodel, and enrich columns.
  4. Partition knowledge to allow environment friendly queries sooner or later.
  5. Rework to a file format optimized for knowledge analytics.
  6. Add knowledge to S3.

I’ve been implementing comparable knowledge pipelines up to now. My most well-liked answer consists of the next constructing blocks:

  • S3 storing the enter knowledge.
  • Apache Airflow to orchestrate the ETL pipeline.
  • Athena to extract, remodel, and cargo the info.
  • S3 storing the output knowledge.

Architecture of a Serverless ETL pipeline on AWS: Airflow, Athena, and S3

How do you construct an ETL pipeline based mostly on Athena?

Amazon Athena permits me to question knowledge saved on S3 on-demand utilizing SQL. The outstanding factor about Athena is that the service is serverless, which implies we solely need to pay for the processed knowledge when operating a question. There aren’t any idle prices besides the S3 storage prices.

As talked about earlier than, in my present mission, the problem is to extract knowledge from CSV information and retailer the info in a approach that’s optimized for knowledge analytics. My method is reworking the CSV information into extra environment friendly codecs equivalent to Parquet. The Parquet file format is designed for environment friendly knowledge evaluation and organizes knowledge in rows, not columns, as CSV does. Subsequently, Athena skips fetching and processing all different columns when querying solely a subset of the accessible columns. Additionally, Parquet compresses the info to reduce storage and community consumption.

I like utilizing Athena for ETL jobs due to its simplicity and pay-per-use pricing mode. The CREATE TABLE AS SELECT (CTAS) assertion implements ETL as described within the following:

  1. Extract: Load knowledge from CSV information saved on S3 (SELECT FROM "awsmp"."cas_daily_business_usage_by_instance_type")
  2. Rework: Filter and enrich columns. (SELECT product_code, SUM(estimated_revenue) AS income, concat(yr, '-', month, '-01') as date)
  3. Load: Retailer leads to Parquet file format on S3 (CREATE TABLE monthly_recurring_revenue).
CREATE TABLE monthly_recurring_revenue WITH (
format = 'Parquet',
external_location = 's3://demo-datalake/monthly_recurring_revenue/',
partitioned_by = ARRAY['date']
) AS SELECT product_code, SUM(estimated_revenue) AS income, concat(yr, '-', month, '-01') as date FROM (
SELECT yr, month, day, product_code, estimated_revenue FROM "awsmp"."cas_daily_business_usage_by_instance_type" ORDER BY yr, month, day
) GROUP BY yr, month, product_code ORDER BY yr, month, product_code

Apart from changing the info into the Parquet file format, the assertion additionally partitions the info. This implies the keys of the objects begin with one thing like date=2022-08-01, which permits Athena to solely fetch related information from S3 when querying by date.

Why did you select Athena as an alternative of Amazon EMR to construct an ETL pipeline?

I’ve been utilizing EMR for some tasks up to now. However these days, I want including Athena to the combination, wherever possible. That’s as a result of in comparison with EMR, Athena is a light-weight answer. Utilizing Athena is much less advanced than operating jobs on EMR. For instance, I want utilizing SQL to rework knowledge as an alternative of writing Python code. It takes me much less time to construct an ETL pipeline with Athena in comparison with EMR.

Additionally, accessing Athena is rather more handy as all performance is on the market through the AWS Administration Console and API. In distinction, it requires a VPN connection to work together effectively with EMR when growing a pipeline.

I want a serverless answer resulting from its value implications. With Athena, our buyer solely pays for the processed knowledge. There aren’t any idling prices. For instance, I migrated a workload from EMR to Athena, which decreased prices from $3,000 to $100.

What’s Apache Airflow?

Apache Airflow is a well-liked open-source mission offering a workflow administration platform for knowledge engineering pipelines. As an information engineer, I describe an ETL pipeline in Python as a directed acyclic graph (DAG).

Right here is an easy directed acyclic graph (DAG). The workflow consists of two steps:

  1. Creating an Athena question.
  2. Awaiting outcomes from the Athena question.
from airflow.fashions import DAG
from airflow.suppliers.amazon.aws.operators.athena import AWSAthenaOperator
from airflow.suppliers.amazon.aws.sensors.athena import AthenaSensor

with DAG(dag_id='demo') as dag:

read_table = AWSAthenaOperator(
task_id='read_table',
question='SELECT * FROM "sampledb"."elb_logs" restrict 10;',
output_location='s3://aws-athena-query-results-486555357186-eu-west-1/airflow/',
database='sampledb'
)

await_query = AthenaSensor(
task_id='await_query',
query_execution_id=read_table.output,
)

Airflow lets you run a DAG manually, through an API, or based mostly on a schedule.

Airflow in Action

Airflow consists of a number of parts:

  • Scheduler
  • Employee
  • Internet Server
  • PostgreSQL database
  • Redis in-memory database

Working such a distributed system is advanced. Fortunately, AWS offers a managed service known as Amazon Managed Workflows for Apache Airflow (MWAA), which we use in my present mission.

What does your improvement workflow for Airflow DAGs appear to be?

We constructed a deployment pipeline for the mission I’m at the moment concerned in. So you may consider growing the ETL pipeline like another software program supply course of.

  1. The engineer pushes modifications of DAGs to a Git repository.
  2. The deployment pipeline validates the Python code.
  3. The deployment pipeline spins up a container based mostly on aws-mwaa-local-runner and verifies whether or not all dependencies are working as anticipated.
  4. The deployment pipeline runs an integration take a look at.
  5. The deployment pipeline uploads the DAGs to S3.
  6. Airflow refreshes the DAGs.

The deployment pipeline considerably quickens the ETL pipeline’s improvement course of, as many points are noticed earlier than deploying to AWS.

Why do you utilize Airflow as an alternative of AWS Step Capabilities?

Normally, Airflow is much like AWS Step Capabilities. Nonetheless, there are two essential variations.

First, Airflow is a well-liked selection for constructing ETL pipelines. Subsequently, many engineers within the area of knowledge analytics have already gained expertise with the instrument. And moreover, the open-source group creates many integrations that assist construct ETL pipelines.

Second, not like Step Capabilities, Airflow will not be solely accessible on AWS. Having the ability to transfer ETL pipelines to a different cloud vendor or on-premises is a plus.

Why do you concentrate on machine studying and knowledge analytics?

I get pleasure from working with knowledge. Having the ability to reply questions by analyzing large quantities of knowledge and enabling higher choices backed by knowledge motivates me. Additionally, I’m an enormous fan of Athena. It’s one of the vital highly effective providers supplied by AWS. On prime of that, machine studying, typically, and strengthened studying, particularly, fascinates me, because it permits us to acknowledge correlations that weren’t seen earlier than.

Would you want to hitch Peter’s group to implement options with the assistance of machine studying and knowledge analytics? tecRacer is hiring Cloud Consultants specializing in machine studying and knowledge analytics. Apply now!

Tags: AirflowAthenaBuildersDiaryETLServerlessVol
Previous Post

Mapping the misuse of generative AI

Next Post

Splunk Assault Vary 4.0 Permits Detection Groups to Emulate Adversary Conduct

Next Post
Progress Knowledge Cloud Accelerates Knowledge and AI Modernization with out Infrastructure Complexity

Splunk Assault Vary 4.0 Permits Detection Groups to Emulate Adversary Conduct

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Trending

Premium Tier and Verified Peering Suppliers allow Cloud WAN

Premium Tier and Verified Peering Suppliers allow Cloud WAN

April 30, 2025
Why Select Pre-Constructed DevOps Infrastructure to Scale back Your Time to Market

Why Select Pre-Constructed DevOps Infrastructure to Scale back Your Time to Market

July 20, 2025
Methods to monetize an API on AWS?

Methods to monetize an API on AWS?

April 9, 2025
ProxySQL Configuration and Administration Cheat Sheet

ProxySQL Configuration and Administration Cheat Sheet

January 26, 2025
How artist Yinka Ilori is utilizing AI to carry his imaginative and prescient to life

How artist Yinka Ilori is utilizing AI to carry his imaginative and prescient to life

February 1, 2025
AWS re:Invent 2024 – Day 3 Recap | Weblog

AWS re:Invent 2024 – Day 3 Recap | Weblog

January 23, 2025

MultiCloud365

Welcome to MultiCloud365 — your go-to resource for all things cloud! Our mission is to empower IT professionals, developers, and businesses with the knowledge and tools to navigate the ever-evolving landscape of cloud technology.

Category

  • AI and Machine Learning in the Cloud
  • AWS
  • Azure
  • Case Studies and Industry Insights
  • Cloud Architecture
  • Cloud Networking
  • Cloud Platforms
  • Cloud Security
  • Cloud Trends and Innovations
  • Data Management
  • DevOps and Automation
  • GCP
  • IAC
  • OCI

Recent News

AzureKeyVault – Synchronize Secrets and techniques to Native Server

AzureKeyVault – Synchronize Secrets and techniques to Native Server

July 20, 2025
How Machine Studying is Redefining Knowledge Mining within the Age of Generative AI | by Nicolas Rowan | Jul, 2025

How Machine Studying is Redefining Knowledge Mining within the Age of Generative AI | by Nicolas Rowan | Jul, 2025

July 20, 2025
  • About Us
  • Privacy Policy
  • Disclaimer
  • Contact

© 2025- https://multicloud365.com/ - All Rights Reserved

No Result
View All Result
  • Home
  • Cloud Architecture
    • OCI
    • GCP
    • Azure
    • AWS
    • IAC
    • Cloud Networking
    • Cloud Trends and Innovations
    • Cloud Security
    • Cloud Platforms
  • Data Management
  • DevOps and Automation
    • Tutorials and How-Tos
  • Case Studies and Industry Insights
    • AI and Machine Learning in the Cloud

© 2025- https://multicloud365.com/ - All Rights Reserved