Deploying Airflow 2 on EKS using Terraform, Helm and ArgoCD — Part 1/2

This is part 1/2 of the article. If you are looking for the second part, please check it here.

There are plenty of available tools to be used in Data Engineering tasks. Thanks to internet we can find several tutorials about how to install those tools and create some deployments by ourselves.

Nevertheless, integrating such tools in a complete deployment is not straightforward sometimes. The idea of this article is to show you how to deploy Apache Airflow 2.x on an AWS EKS Cluster using Helm charts and ArgoCD.

In this article I hope to help you understanding how to integrate these amazing tools and how to deploy helm charts on EKS using Terraform. Also, how to deploy other helm charts using ArgoCD with GitSync.

Prerequisites

Important

In case you follow this tutorial, you can be billed by AWS resources you use.

The Stack

Since there are many tutorials on internet about each of the tools we are going to use here, I will briefly describe each of them:

  • Terraform: it is an open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services. Terraform codifies cloud APIs into declarative configuration files.
  • AWS EKS: Amazon Elastic Kubernetes Service (Amazon EKS) is a managed service that you can use to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane or nodes. Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications.
  • Helm Chart: A chart is a collection of files that describe a related set of Kubernetes resources. A single chart might be used to deploy something simple, like a memcached pod, or something complex, like a full web app stack with HTTP servers, databases, caches, and so on.
  • Apache Airflow: Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.
  • ArgoCD: Argo CD automates the deployment of the desired application states in the specified target environments. Application deployments can track updates to branches, tags, or pinned to a specific version of manifests at a Git commit.

Project Structure

Root folder structure
  • airflow: it contains the dags we want to deploy in our Apache Airflow environment;
  • infrastructure: it contains all infrastructure configuration files for Terraform and Kubernetes
infrastructure folder
  • kubernetes: it contains all the helm configuration we want to deploy using ArgoCD
  • terraform: it contains terraform configuration files to bring the cloud infra up

Terraform Setup

Inside terraform folder we will create two more folders: infra and applications

  • infra: contains general infrastructure configuration like VPC, EKS and RDS
  • applications: contains terraform configuration files for Helm Provider

Since we have a lot of files to configure, I will mention only the most important parts of the code here. The full project you can find on link below:

  • infrastructure/terraform/versions.tf

We need to set the terraform version we are going to use and which providers we want to be installed.

terraform {
required_version = ">= 0.13.1"
required_providers {
aws = ">= 3.22.0"
local = ">= 1.4"
random = ">= 2.1"
kubernetes = ">= 1.13"
}
}
  • infrastructure/terraform/main.tf
provider "aws" {
region = var.aws_region
profile = var.aws_profile
}
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "3.7.0"
name = "data_plataform"
cidr = "192.168.0.0/16"
azs = var.azs
private_subnets = var.private_subnets
public_subnets = var.public_subnets
enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true
tags = {
Terraform = "true"
Environment = "dev"
}
}
module "eks" {
source = "./infra/eks"
vpc = module.vpc
airflowdb_host = module.rds.rds_host
airflowdb_username = module.rds.rds_username
airflowdb_password = module.rds.rds_password
airflowdb_dbname = module.rds.rds_dbname
}
module "ec2" {
source = "./infra/ec2"
vpc = module.vpc
}
module "rds" {
source = "./infra/rds"
vpc = module.vpc
vpc_security_group_ids = [module.ec2.rds_security_group.id]
}
module "kubernetes-dashboard" {
source = "./applications/kubernetes/kubedashboard"
eks_module = module.eks.eks_all
}

Modules:

  • vpc: it creates a vpc and we use it to deploy all the resources.
  • ec2: there is a security group that we need to create in order to enable communication between EKS and RDS Postgres
  • rds: airflow needs a database to handle metadata and other configurations. Using RDS is a good choice for this purpose.
  • eks: it contains the cluster configuration and application deployments

EKS Module

Let’s talk about the EKS Cluster configuration.

module "eks" {
source = "terraform-aws-modules/eks/aws"
cluster_name = var.eks_cluster_name
cluster_version = "1.21"
vpc_id = local.vpc.vpc_id
subnets = [local.vpc.private_subnets[0], local.vpc.public_subnets[1]]
fargate_subnets = [local.vpc.private_subnets[2]]
cluster_log_retention_in_days = 3cluster_endpoint_public_access = trueworker_additional_security_group_ids = [aws_security_group.all_worker_mgmt.id]worker_groups = [
{
name = "worker-group-1"
instance_type = "t3.small"
asg_desired_capacity = 2
asg_max_size = 5
platform = "linux"
}
]
tags = {
Environment = "dev"
Name = "eks-dep-cluster"
}
}
resource "kubernetes_service_account" "eksadmin" {
metadata {
name = "eks-admin"
namespace = "kube-system"
}
}
resource "kubernetes_cluster_role_binding" "eksadmin" {
metadata {
name = "eks-admin"
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = "cluster-admin"
}
subject {
kind = "ServiceAccount"
name = "eks-admin"
namespace = "kube-system"
}
}
resource "kubernetes_namespace" "airflow" {
metadata {
name = "airflow"
}
}
resource "kubernetes_secret" "airflow_db_credentials" {
metadata {
name = "airflow-db-auth"
namespace = kubernetes_namespace.airflow.metadata[0].name
}
data = {
"postgresql-password" = "${var.airflowdb_password}"
}
}
#############
# Kubernetes
#############
data "aws_eks_cluster" "cluster" {
name = module.eks.cluster_id
}
data "aws_eks_cluster_auth" "cluster" {
name = module.eks.cluster_id
}
provider "kubernetes" {
host = data.aws_eks_cluster.cluster.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority[0].data)
token = data.aws_eks_cluster_auth.cluster.token
}
locals {
vpc = var.vpc
}
resource "aws_security_group" "all_worker_mgmt" {
name_prefix = "all_worker_management"
vpc_id = local.vpc.vpc_id
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [
"10.0.0.0/8",
"172.16.0.0/12",
"192.168.0.0/16",
]
}
}

We first need to setup the cluster. Terraform provides some open-source and well implemented modules in order to make complex deployments easier. If you are going to use them or not depends on your environment policies and customization level you need.

For this tutorial, we can go ahead since we need only the most common configurations.

Cluster Configuration

  • cluster_name: the name of your cluster
  • cluster_version: I am using 1.21
  • vpc_id: the VPC we created using VPC module
  • subnets: the subnets we created using VPC module
  • cluster_log_retention_in_days: how many days you want to keep the logs
  • cluster_endpoint_public_access: we are using True in this tutorial but this is not recommended in production.
  • worker_additional_security_group_ids: security group to be used among worker nodes
  • worker_groups: we set the server worker nodes configuration

This file is also setting up a security group to enable communication between working nodes and a Kubernetes Namespace for Airflow.

Below code snippet shows this configuration:

resource "kubernetes_namespace" "airflow" {
metadata {
name = "airflow"
}
}
resource "kubernetes_secret" "airflow_db_credentials" {
metadata {
name = "airflow-db-auth"
namespace = kubernetes_namespace.airflow.metadata[0].name
}
data = {
"postgresql-password" = "${var.airflowdb_password}"
}
}

Also, a Kubernetes Secret is created to handle sensitive database connection settings.

terraform folder structure

Now it’s time to deploy an application into our Kubernetes Cluster.

As you can see in the picture above, we have the applications folder. Let’s take a look into it.

You will find a folder called kubernetes with the following content:

Applications to be deployed on Kubernetes

Both folders have terraform files and we are going to use helm provider in order to deploy the applications we want.

Let’s take argocd as an exaple:

locals {
eks = var.eks_module
}
data "aws_eks_cluster" "cluster" {
name = local.eks.cluster_id
}
data "aws_eks_cluster_auth" "cluster" {
name = local.eks.cluster_id
}
provider "kubernetes" {
host = data.aws_eks_cluster.cluster.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
token = data.aws_eks_cluster_auth.cluster.token
}
resource "kubernetes_namespace" "argocd" {
metadata {
name = "argocd"
}
}
provider "helm" {
kubernetes {
host = data.aws_eks_cluster.cluster.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
exec {
api_version = "client.authentication.k8s.io/v1alpha1"
args = ["eks", "get-token", "--cluster-name", data.aws_eks_cluster.cluster.name]
command = "aws"
}
}
}
resource "helm_release" "argocd" {
name = "argocd"
repository = "https://argoproj.github.io/argo-helm"
chart = "argo-cd"
namespace = kubernetes_namespace.argocd.metadata.0.name
}
  • We are creating a kubernetes_namespace called argocd and we will use it to deploy argocd using helm;
  • We need to configure the helm provider and set which cluster we are going to use;
  • Last but not least, we need to create a “helm_release” resource and set the helm repository from which we want to deploy the ArgoCD application;
  • There are more detailed configuration that can be changed at this point, but we are going with the default installation for ArgoCD
https://argoproj.github.io/argo-helm
  • argocd: it is the application we will use to deploy other projects into our Kubernetes Cluster
  • kubedashboard: it’s an application to help us to visualize what is happening in our cluster

Deploy Terraform

In order to deploy the terraform project follow below commands inside your terraform folder (airflow-kubernetes-iac/infrastructure/terraform):

  1. terraform init
  2. terraform validate
  3. terraform apply -var-file terraform.tfvars

After sometime, you should see a message like below:

This is how your EKS cluster should look like after the deployment:

How to access the Kubernetes Dashboard

  1. First you need to configure your aws cli to access the EKS cluster
aws eks update-kubeconfig \ 
--region us-east-2 \
--name <cluster_name>

2. Get temporary token to access the Kubernetes Dashboard

kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep eks-admin | awk '{print $1}')

3. Open another terminal and run the following script:

kubectl proxy

Leave this terminal running. Don’t close it otherwise you will interrupt the connection between your local machine and EKS cluster

kubectl proxy configuration

4. Open the Kubernetes Dashboard on your web browser:

http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:443/proxy/#/login

5. Type the token you got on Step 2 into the “Enter token *” field

6. Click on Sign In

7. Great! Now you can see what is going on in your Kubernetes cluster running on EKS!

8. In the combobox you see the “default” namespace selected. Change it to “argocd”. Now you can see the ArgoCD which we deployed using terraform as same as the Dashboard.

How to access ArgoCD UI

  1. Open the following URL in your web browser:

http://localhost:8001/api/v1/namespaces/argocd/services/https:argocd-server:443/proxy/

ArgoCD Login Page

2. Username: admin

3. Password:

Run in your terminal to get the password:

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

4. Click on Sign In

5. Great! You have just access the ArgoCD UI which we have deployed using IaC with Terraform!

Summary

Great! Now you know how to start deploying EKS and Helm charts using Terraform. This knowledge can be reproduced with any other Helm Chart you want to deploy.

There a lot of helm charts and you can find some of them on this website:

In part 2/2 of this article we are going to configure the ArgoCD to access our repository and deploy Apache Airflow 2.0 using helm chart. It will use GitOps to get any change we commit into our repo and trigger the deployment automatically.

Attention

If you let this infrastructure up and running you will be billed by the time it was kept up. So please make sure you destroy everything after your study.

terraform destroy -var-file terraform.tfvars

--

--

--

I’m a Data Engineer and guitar player.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Analyze Data from the IEX Cloud API with Python (Requests + Streamlit)

How-To: Vim Functionality with Sublime Looks

Explore Kotlin Sealed Classes

Infrastructure as Code: An Alternative to Standard Configuration Management

Black Friday Preparations

Ubiquiti EdgeOS Terraform Provider

Snippets in PHP: PluckMultiple for Laravel

Vitor Carra

Vitor Carra

I’m a Data Engineer and guitar player.

More from Medium

Deploying Airflow 2 on EKS using Terraform, Helm and ArgoCD — Part 2/2

Autoscaling your Airflow using DataDog External Metrics

Stream Landing Kafka Data to Object Storage using Terraform

Deploying Airflow in Local Kubernetes Cluster: Part II