~/Blog/Nikhil

Code, Vibes and Nostalgia

Nikhil Mishra — Fri, 09 May 2025 09:30:26 GMT

My Journey with Vibe Coding 🚀

My journey with vibe coding started when I made aushadiai.nikhilmishra.live in a hackathon,
I got addicted to it.

Next, I wanted to build games.

So I recreated 2 games that I used to play as a child on a Nokia handset:

bounce.nikhilmishra.live
This didn’t turn out as I planned but I tried.
isitcricket.nikhilmishra.live
This was a success for me, I got praises from a lot of cool people.

My brother wanted some Pokémon cards.

So I vibe coded pokemon.nikhilmishra.live,
Here he gets unlimited cards.

Visit all of them and share what you think about them with me!

Every click and comment means a lot.

Thanks for Reading!

If you enjoyed my journey or have any feedback, let me know in the comments below.
Your thoughts motivate me to build more fun projects!

From First Principles to Production: Automating GitHub Pages Deployments with GitHub Actions

Nikhil Mishra — Fri, 14 Mar 2025 12:01:51 GMT

Introduction

In the world of modern software development, automation is not just a convenience but a necessity. Continuous Integration and Continuous Deployment (CI/CD) have become fundamental practices that enable developers to deliver software updates rapidly and reliably. At the heart of this revolution is GitHub Actions, a powerful automation platform integrated directly into the GitHub ecosystem.

This blog post explores GitHub Actions from first principles, breaking down the concepts and components that make automated deployments possible. Using a real-world example of deploying a static website to GitHub Pages, we'll examine the underlying mechanics and best practices of GitHub Actions workflows.

Understanding CI/CD from First Principles

Before diving into the technical details, let's understand the fundamental concepts behind CI/CD:

Continuous Integration (CI): The practice of frequently integrating code changes into a shared repository, followed by automated builds and tests.
Continuous Deployment (CD): The practice of automatically deploying every change that passes all test stages to production.

At their core, these practices address a fundamental challenge in software development: how to safely and efficiently move code from development to production.

The fundamental components of any CI/CD system include:

Triggers: Events that initiate the automation workflow
Runners: Environments where the automation tasks are executed
Steps: Individual tasks to be performed
Artifacts: Files produced during the workflow execution
Environments: Deployment targets with specific configurations

GitHub Actions: Architecture and Components

GitHub Actions is built on a simple yet powerful model that follows these first principles. Let's visualize its architecture:

graph TD
    A[Repository] -->|Event Trigger| B[Workflow]
    B --> C[Jobs]
    C --> D[Steps]
    D --> E[Actions]
    E --> F[Outputs]
    E --> G[Artifacts]
    G --> H[Deployment]

The key components are:

Workflows: YAML files that define the automation process
Events: Triggers like push, pull request, or scheduled events
Jobs: Groups of steps that execute on the same runner
Steps: Individual tasks that run commands or actions
Actions: Reusable units of code that can be shared and consumed
Runners: The compute infrastructure where workflows run

Anatomy of a GitHub Pages Deployment Workflow

Let's examine our example project's workflow file from first principles. The workflow is defined in .github/workflows/deploy.yml:

name: Deploy to GitHub Pages

on:
  push:
    branches:
      - main

permissions:
  contents: read
  pages: write
  id-token: write

jobs:
  deploy:
    environment:
      name: github-pages
      url: ${{ steps.deployment.outputs.page_url }}
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Setup Pages
        uses: actions/configure-pages@v4
      - name: Upload artifact
        uses: actions/upload-pages-artifact@v3
        with:
          path: '.'
      - name: Deploy to GitHub Pages
        id: deployment
        uses: actions/deploy-pages@v4

This seemingly simple file encapsulates several fundamental principles:

1. Event-Driven Architecture

The workflow is triggered by a specific event: a push to the main branch.

graph LR
    A[Developer] -->|Pushes to| B[Main Branch]
    B -->|Triggers| C[Workflow]

2. Declarative Configuration

The entire workflow is defined declaratively, stating what should happen rather than how. This follows the principle of infrastructure as code.

3. Security-First Design

The permissions section explicitly defines the minimal set of permissions needed, following the principle of least privilege:

flowchart TD
    A[Workflow Permissions] --> B[contents: read]
    A --> C[pages: write]
    A --> D[id-token: write]

    B --> E[Read repository content]
    C --> F[Modify GitHub Pages]
    D --> G[Create & use OIDC tokens]

4. Sequential Pipeline Architecture

The steps form a sequential pipeline, where each step builds on the previous one:

sequenceDiagram
    participant R as Repository
    participant W as Workflow
    participant GH as GitHub Pages

    W->>R: Checkout code
    W->>W: Setup Pages configuration
    W->>W: Create artifact
    W->>GH: Deploy artifact
    GH->>GH: Publish website

Breaking Down the Workflow: Step-by-Step Analysis

Let's examine each step of the workflow from first principles:

1. Checkout

- name: Checkout
  uses: actions/checkout@v4

This step fetches the repository content. It's fundamental because:

It provides the workflow with access to the source code
It allows the workflow to operate on the latest version of the code
Without it, the workflow would have nothing to deploy

2. Setup Pages

- name: Setup Pages
  uses: actions/configure-pages@v4

This step configures the GitHub Pages environment. It's necessary because:

It initializes the required environment variables
It sets up the underlying GitHub Pages infrastructure
It prepares the system for artifact upload and deployment

3. Upload Artifact

- name: Upload artifact
  uses: actions/upload-pages-artifact@v3
  with:
    path: '.'

This step packages the site content. It follows the principle of immutable artifacts:

It creates a snapshot of the site at a specific point in time
It provides a consistent package that can be deployed
It enables potential rollbacks by preserving the artifact

4. Deploy

- name: Deploy to GitHub Pages
  id: deployment
  uses: actions/deploy-pages@v4

The final step actually publishes the site. It embodies the principle of separation of concerns:

Building the artifact and deploying it are separate operations
This allows for different permissions and controls at each stage
It supports a more secure deployment pipeline

The Principle of Least Privilege in Action

One of the most important security principles is providing only the minimum permissions necessary. Our workflow demonstrates this with:

permissions:
  contents: read
  pages: write
  id-token: write

This explicit permission model:

Prevents the workflow from modifying repository content
Allows writing only to GitHub Pages
Provides identity token access for secure deployments

Let's visualize how these permissions integrate with the workflow:

graph TD
    subgraph "Repository Boundary"
        A[Repository Content] --- B[Read-Only Access]
    end

    subgraph "GitHub Pages Boundary"
        C[GitHub Pages] --- D[Write Access]
    end

    subgraph "Authentication"
        E[OIDC Tokens] --- F[Write Access]
    end

    B --> G[Checkout Step]
    D --> H[Deploy Step]
    F --> I[Authentication for Deployment]

Environment-Based Deployment

The workflow uses a specific environment for deployment:

environment:
  name: github-pages
  url: ${{ steps.deployment.outputs.page_url }}

This follows the principle of environment segregation:

Deployment targets are explicitly defined
Each environment can have its own protection rules
Deployment URLs are tracked and linked to the workflow

First Principles of Static Site Deployment

Our example deploys a simple static HTML site:

html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My GitHub Pages Sitetitle>
    
head>
<body>
    <div class="container">
        <h1>Hello, GitHub Actions!h1>
        <p>This site is deployed using GitHub Actions and GitHub Pages.p>
        <p>Last updated: February 4, 2025 at 10:10 AM ISTp>
    div>
body>
html>

From first principles, deployment of static content requires:

Content Storage: A place to store the HTML, CSS, and JavaScript files
Content Serving: A web server to deliver the files to browsers
Content Delivery: A way to efficiently distribute the content

GitHub Pages handles all three aspects:

It stores the content in GitHub's infrastructure
It serves the content via GitHub's servers
It delivers the content via GitHub's CDN

The Complete Deployment Flow

Let's visualize the entire process from code push to website delivery:

graph TB
    A[Developer] -->|1. Push to main| B[GitHub Repository]
    B -->|2. Trigger workflow| C[GitHub Actions]

    subgraph "GitHub Actions Workflow"
        C -->|3. Checkout code| D[Runner]
        D -->|4. Setup Pages| E[Configure Environment]
        E -->|5. Create artifact| F[Artifact]
        F -->|6. Deploy| G[GitHub Pages Service]
    end

    G -->|7. Publish| H[Live Website]
    H -->|8. Serve content| I[End Users]

    classDef blue fill:#2b88d8,stroke:#000,stroke-width:1px,color:white;
    classDef green fill:#25a767,stroke:#000,stroke-width:1px,color:white;
    classDef orange fill:#ff9900,stroke:#000,stroke-width:1px,color:white;

    class A,B blue
    class C,D,E,F green
    class G,H,I orange

Implementation Considerations

When implementing GitHub Actions workflows from first principles, consider:

1. Workflow Isolation

Each workflow should have a single responsibility. For our static site, deployment is the sole responsibility. For more complex applications, you might have separate workflows for:

Building and testing
Security scanning
Deployment to staging
Deployment to production

2. Artifact Immutability

Once created, artifacts should not be modified. This ensures consistency across environments and enables reliable rollbacks.

3. Idempotent Deployments

Deployments should be idempotent - running the same deployment multiple times should result in the same final state.

4. Failure Handling

Workflows should fail fast and provide clear error messages. This reduces troubleshooting time and improves developer experience.

Extending the Workflow

From first principles, we can extend our basic workflow to include additional steps:

1. Testing

Add automated tests to verify site functionality:

- name: Test
  run: |
    npm install -g htmlhint
    htmlhint index.html

2. Performance Optimization

Optimize assets before deployment:

- name: Optimize
  run: |
    npm install -g html-minifier
    html-minifier --collapse-whitespace index.html -o index.html

3. Security Scanning

Add security checks to prevent vulnerable code from being deployed:

- name: Security scan
  uses: aquasecurity/trivy-action@master
  with:
    scan-type: 'fs'
    format: 'table'

Conclusion: From Principles to Practice

By understanding GitHub Actions from first principles, we gain insights beyond simply following tutorials:

We understand why each component exists: Each part of the workflow serves a specific purpose in the deployment pipeline.
We can troubleshoot effectively: Knowledge of the underlying principles helps identify and fix issues when they arise.
We can extend and customize: Instead of blindly copying examples, we can adapt workflows to our specific needs.
We make better security decisions: Understanding the permission model allows us to implement the principle of least privilege.

GitHub Actions workflows embody fundamental software engineering principles: separation of concerns, infrastructure as code, principle of least privilege, and immutable artifacts. By applying these principles to our deployments, we create robust, secure, and maintainable automation pipelines.

The next time you set up a GitHub Actions workflow, consider the principles behind each configuration option. This first-principles approach will lead to more thoughtful and effective automation strategies.

Would you like to learn more about CI/CD principles or explore more advanced GitHub Actions workflows? Let me know in the comments below!

Building a Cloud-Native DevOps Pipeline from First Principles

Nikhil Mishra — Sun, 09 Mar 2025 12:12:43 GMT

Introduction

In today's rapidly evolving technological landscape, understanding how to build and deploy applications using cloud-native methodologies is essential for any software engineer. This blog post details the implementation of a complete DevOps pipeline for the vProfile application, a multi-tier Java web application, using Infrastructure as Code (IaC), containerization, Kubernetes orchestration, and CI/CD practices.

This project applies modern DevOps practices from first principles, creating a robust, scalable, and automated deployment pipeline on AWS cloud infrastructure. We'll explore each component of the system, understand the underlying principles, and see how they work together to create a seamless deployment experience.

Project Architecture Overview

The vProfile project utilizes a microservices architecture deployed on AWS using Kubernetes. Before diving into the implementation details, let's understand the high-level architecture:

flowchart TD
    subgraph "CI/CD Pipeline"
        GH[GitHub Repositories]
        GA[GitHub Actions]
        TS[Test & SonarQube]
        DB[Docker Build]
        ECR[Amazon ECR]
    end

    subgraph "AWS Infrastructure"
        TF[Terraform]
        VPC[AWS VPC]
        EKS[Amazon EKS]
        S3[S3 State Backend]
    end

    subgraph "Kubernetes Deployment"
        HC[Helm Charts]
        ING[NGINX Ingress]
        APP[Vprofile App]
        DB2[MySQL]
        MC[Memcached]
        RMQ[RabbitMQ]
    end

    GH --> GA
    GA --> TS
    TS --> DB
    DB --> ECR

    GH --> TF
    TF --> VPC
    TF --> EKS
    TF --> S3
    TF --> ING

    ECR --> HC
    HC --> APP
    HC --> DB2
    HC --> MC
    HC --> RMQ

    ING --> APP

The project is divided into two main repositories:

iac-vprofile: Responsible for infrastructure provisioning using Terraform
vprofile-action: Contains the application code and deployment configurations

This separation of concerns ensures that infrastructure and application code can evolve independently while maintaining a cohesive deployment strategy.

First Principles: Understanding the Core Concepts

Infrastructure as Code (IaC)

At its core, Infrastructure as Code is about managing infrastructure through machine-readable definition files rather than manual processes. This approach offers several key benefits:

Reproducibility: Infrastructure can be consistently reproduced across different environments
Version Control: Infrastructure changes can be tracked, reviewed, and rolled back
Automation: Reduces manual errors and increases deployment speed
Documentation: The code itself documents the infrastructure

In our project, we use Terraform to define our AWS infrastructure, including VPC, subnets, and the EKS cluster.

Containerization

Containers encapsulate an application and its dependencies into a self-contained unit that can run anywhere. The key principles include:

Isolation: Applications run in isolated environments
Portability: Containers run consistently across different environments
Efficiency: Lighter weight than virtual machines
Scalability: Containers can be easily scaled horizontally

We use Docker to containerize our vProfile application, creating a multi-stage build process that optimizes the final image size.

Orchestration

Container orchestration automates the deployment, scaling, and management of containerized applications. Core principles include:

Service Discovery: Containers can find and communicate with each other
Load Balancing: Traffic is distributed across containers
Self-healing: Failed containers are automatically replaced
Scaling: Applications can scale up or down based on demand

Amazon EKS (Elastic Kubernetes Service) serves as our orchestration platform, providing a managed Kubernetes environment.

Continuous Integration and Continuous Deployment (CI/CD)

CI/CD bridges the gap between development and operations by automating the building, testing, and deployment processes:

Continuous Integration: Code changes are regularly built and tested
Continuous Delivery: Code is always in a deployable state
Continuous Deployment: Code changes are automatically deployed to production
Feedback Loops: Developers get quick feedback on changes

GitHub Actions powers our CI/CD pipeline, automating everything from code testing to Kubernetes deployment.

Implementing the Infrastructure with Terraform

VPC Configuration

The foundation of our AWS infrastructure is a well-architected Virtual Private Cloud (VPC):

flowchart TD
    subgraph "AWS Region us-east-2"
        VPC["VPC: 172.20.0.0/16"]

        subgraph "Availability Zones"
            AZ1["AZ 1"]
            AZ2["AZ 2"]
            AZ3["AZ 3"]
        end

        subgraph "Private Subnets"
            PS1["172.20.1.0/24"]
            PS2["172.20.2.0/24"]
            PS3["172.20.3.0/24"]
        end

        subgraph "Public Subnets"
            PUS1["172.20.4.0/24"]
            PUS2["172.20.5.0/24"]
            PUS3["172.20.6.0/24"]
        end

        NG["NAT Gateway"]
        IGW["Internet Gateway"]
    end

    VPC --> AZ1
    VPC --> AZ2
    VPC --> AZ3

    AZ1 --> PS1
    AZ2 --> PS2
    AZ3 --> PS3

    AZ1 --> PUS1
    AZ2 --> PUS2
    AZ3 --> PUS3

    PUS1 --> IGW
    PUS2 --> IGW
    PUS3 --> IGW

    PS1 --> NG
    PS2 --> NG
    PS3 --> NG

    NG --> IGW

Our Terraform configuration creates a VPC with a CIDR block of 172.20.0.0/16, spanning three availability zones for high availability. It includes:

Three private subnets for EKS worker nodes
Three public subnets for the load balancer
NAT gateway for outbound internet access from private subnets
Appropriate tags for Kubernetes integration

Here's a key excerpt from our vpc.tf:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.2"

  name = "vprofile-eks"
  cidr = "172.20.0.0/16"
  azs  = slice(data.aws_availability_zones.available.names, 0, 3)

  private_subnets = ["172.20.1.0/24", "172.20.2.0/24", "172.20.3.0/24"]
  public_subnets  = ["172.20.4.0/24", "172.20.5.0/24", "172.20.6.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  public_subnet_tags = {
    "kubernetes.io/cluster/${local.cluster_name}" = "shared"
    "kubernetes.io/role/elb"                      = 1
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/${local.cluster_name}" = "shared"
    "kubernetes.io/role/internal-elb"             = 1
  }
}

EKS Cluster Configuration

Amazon EKS provides a managed Kubernetes control plane, while our worker nodes run in the private subnets of our VPC:

flowchart TD
    subgraph "Amazon EKS"
        CP["Control Plane"]

        subgraph "Node Group 1"
            NG1N1["t3.small"]
            NG1N2["t3.small"]
        end

        subgraph "Node Group 2"
            NG2N1["t3.small"]
        end
    end

    subgraph "VPC"
        PS["Private Subnets"]
    end

    subgraph "Autoscaling"
        ASG["ASG Config:
        Group 1: 1-3 nodes
        Group 2: 1-2 nodes"]
    end

    CP --> NG1N1
    CP --> NG1N2
    CP --> NG2N1

    NG1N1 --> PS
    NG1N2 --> PS
    NG2N1 --> PS

    ASG --> NG1N1
    ASG --> NG1N2
    ASG --> NG2N1

Our EKS configuration creates a cluster with version 1.27 and two managed node groups running on t3.small instances:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "19.19.1"

  cluster_name    = local.cluster_name
  cluster_version = "1.27"

  vpc_id                         = module.vpc.vpc_id
  subnet_ids                     = module.vpc.private_subnets
  cluster_endpoint_public_access = true

  eks_managed_node_group_defaults = {
    ami_type = "AL2_x86_64"
  }

  eks_managed_node_groups = {
    one = {
      name = "node-group-1"
      instance_types = ["t3.small"]
      min_size     = 1
      max_size     = 3
      desired_size = 2
    }

    two = {
      name = "node-group-2"
      instance_types = ["t3.small"]
      min_size     = 1
      max_size     = 2
      desired_size = 1
    }
  }
}

Terraform Workflow Automation

We use GitHub Actions to automate the Terraform workflow, ensuring consistent infrastructure deployments:

sequenceDiagram
    actor Developer
    participant GitHub as GitHub Repository
    participant Actions as GitHub Actions
    participant S3 as S3 Backend
    participant AWS as AWS Services

    Developer->>GitHub: Push code to main branch
    GitHub->>Actions: Trigger workflow

    Actions->>Actions: terraform init
    Actions->>S3: Retrieve state
    S3->>Actions: Return state

    Actions->>Actions: terraform fmt check
    Actions->>Actions: terraform validate
    Actions->>Actions: terraform plan

    alt if main branch
        Actions->>Actions: terraform apply
        Actions->>AWS: Create/update resources
        AWS-->>Actions: Resources created/updated

        Actions->>AWS: Configure kubectl
        Actions->>AWS: Install NGINX Ingress
    end

The workflow includes:

Terraform initialization with S3 backend
Format checking and validation
Planning the infrastructure changes
Applying changes only on the main branch
Configuring kubectl and installing the NGINX ingress controller

Application Architecture and Containerization

vProfile Application Components

The vProfile application consists of multiple microservices:

flowchart TD
    subgraph "vProfile Application"
        WA["Web Application (Tomcat)"]
        DB["MySQL Database"]
        MC["Memcached"]
        RMQ["RabbitMQ"]
    end

    User["User"] --> WA
    WA --> DB
    WA --> MC
    WA --> RMQ

Each component is containerized and deployed as a separate service in Kubernetes.

Multi-stage Docker Build

We use a multi-stage Docker build to optimize our application container:

flowchart LR
    subgraph "Build Stage"
        JDK["OpenJDK 11"]
        MVN["Maven Build"]
        WAR["vprofile-v2.war"]
    end

    subgraph "Final Stage"
        TC["Tomcat 9"]
        DEPLOY["Deploy WAR"]
    end

    JDK --> MVN
    MVN --> WAR
    WAR --> DEPLOY
    TC --> DEPLOY

The Dockerfile efficiently builds the application and creates a lean production image:

FROM openjdk:11 AS BUILD_IMAGE
RUN apt update && apt install maven -y
COPY ./ vprofile-project
RUN cd vprofile-project &&  mvn install 

FROM tomcat:9-jre11
LABEL "Project"="Vprofile"
LABEL "Author"="Imran"
RUN rm -rf /usr/local/tomcat/webapps/*
COPY --from=BUILD_IMAGE vprofile-project/target/vprofile-v2.war /usr/local/tomcat/webapps/ROOT.war

EXPOSE 8080
CMD ["catalina.sh", "run"]

Kubernetes Deployment with Helm

Helm Charts Structure

Helm is used to template and parameterize our Kubernetes manifests:

flowchart TD
    subgraph "Helm Chart Structure"
        CH["vprofilecharts/"]
        VAL["values.yaml"]
        TPL["templates/"]

        subgraph "Templates"
            APP["vproappdep.yml"]
            SVC["Service definitions"]
            ING["vproingress.yaml"]
            DB["Database templates"]
            MC["Memcached templates"]
            RMQ["RabbitMQ templates"]
        end
    end

    CH --> VAL
    CH --> TPL
    TPL --> APP
    TPL --> SVC
    TPL --> ING
    TPL --> DB
    TPL --> MC
    TPL --> RMQ

Application Deployment

The application deployment includes initialization containers to ensure dependencies are available:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vproapp
  labels: 
    app: vproapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vproapp
  template:
    metadata:
      labels:
        app: vproapp
    spec:
      containers:
      - name: vproapp
        image: {{ .Values.appimage}}:{{ .Values.apptag}}
        ports:
        - name: vproapp-port
          containerPort: 8080
      initContainers:
      - name: init-mydb
        image: busybox
        command: ['sh', '-c', 'until nslookup vprodb.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for mydb; sleep 2; done;']
      - name: init-memcache
        image: busybox
        command: ['sh', '-c', 'until nslookup vprocache01.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for mydb; sleep 2; done;']

Ingress Configuration

The NGINX ingress controller routes external traffic to our application:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vpro-ingress
  annotations:
    nginx.ingress.kubernetes.io/use-regex: "true"
spec:
  ingressClassName: nginx
  rules:
  - host: majorproject.nikhilmishra.live
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app
            port:
              number: 8080

CI/CD Pipeline with GitHub Actions

Application CI/CD Workflow

Our GitHub Actions workflow for the application pipeline includes testing, building, and deploying:

sequenceDiagram
    actor Developer
    participant GitHub as GitHub
    participant GHCI as GitHub Actions
    participant Sonar as SonarQube
    participant Docker as Docker Build
    participant ECR as Amazon ECR
    participant EKS as Amazon EKS

    Developer->>GitHub: Push code
    GitHub->>GHCI: Trigger workflow

    GHCI->>GHCI: Maven test
    GHCI->>GHCI: Checkstyle
    GHCI->>Sonar: SonarQube scan

    GHCI->>Docker: Build image
    Docker->>ECR: Push image

    GHCI->>EKS: Configure kubectl
    GHCI->>EKS: Create Docker registry secret
    GHCI->>EKS: Deploy with Helm

The workflow includes:

Testing Phase:
- Maven tests
- Code style checks
- SonarQube analysis for code quality
Build and Publish Phase:
- Docker image building
- Push to Amazon ECR
Deployment Phase:
- Configure kubectl
- Create registry credentials
- Deploy using Helm

Here's a key excerpt from the workflow file:

name: vprofile actions
on: workflow_dispatch
env:
  AWS_REGION: us-east-2
  ECR_REPOSITORY: vprofileapp
  EKS_CLUSTER: vprofile-eks

jobs:
  Testing:
    runs-on: ubuntu-latest
    steps:
      - name: Code checkout
        uses: actions/checkout@v4

      - name: Maven test
        run: mvn test

      - name: Checkstyle
        run: mvn checkstyle:checkstyle

      # More testing steps...

  BUILD_AND_PUBLISH:   
    needs: Testing
    runs-on: ubuntu-latest
    steps:
      - name: Code checkout
        uses: actions/checkout@v4

      - name: Build & Upload image to ECR
        uses: appleboy/docker-ecr-action@master
        with:
         access_key: ${{ secrets.AWS_ACCESS_KEY_ID }}
         secret_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
         registry: ${{ secrets.REGISTRY }}
         repo: ${{ env.ECR_REPOSITORY }}
         region: ${{ env.AWS_REGION }}
         tags: latest,${{ github.run_number }}
         daemon_off: false
         dockerfile: ./Dockerfile
         context: ./

  DeployToEKS:
    needs: BUILD_AND_PUBLISH
    runs-on: ubuntu-latest
    steps:
      # Deployment steps...
      - name: Deploy Helm
        uses: bitovi/github-actions-deploy-eks-helm@v1.2.8
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}
          cluster-name: ${{ env.EKS_CLUSTER }}
          chart-path: helm/vprofilecharts
          namespace: default
          values: appimage=${{ secrets.REGISTRY }}/${{ env.ECR_REPOSITORY }},apptag=${{ github.run_number }}
          name: vprofile-stack

The Complete System: Integration and Flow

Now that we've examined each component individually, let's see how they work together in a complete CI/CD pipeline:

flowchart TD
    subgraph "Developer Workflow"
        IC["Infrastructure Code Changes"]
        AC["Application Code Changes"]

        subgraph "GitHub Repositories"
            IRep["iac-vprofile"]
            ARep["vprofile-action"]
        end
    end

    subgraph "Infrastructure Pipeline"
        IGH["GitHub Actions"]
        TInit["Terraform Init"]
        TPlan["Terraform Plan"]
        TApply["Terraform Apply"]
        KConfig["kubectl Config"]
        NGinst["NGINX Ingress Install"]
    end

    subgraph "Application Pipeline"
        AGH["GitHub Actions"]
        Test["Maven Tests"]
        CS["Checkstyle"]
        SQ["SonarQube Analysis"]
        DocBuild["Docker Build"]
        DocPush["Push to ECR"]
        HDepl["Helm Deployment"]
    end

    subgraph "AWS Infrastructure"
        VPC["AWS VPC"]
        EKS["Amazon EKS"]
        ECR["Amazon ECR"]
        S3["S3 State Bucket"]
    end

    subgraph "Kubernetes Resources"
        ING["Ingress"]
        APP["vProfile App"]
        DB["MySQL"]
        MC["Memcached"]
        RMQ["RabbitMQ"]
    end

    subgraph "End Users"
        User["Users"]
    end

    IC --> IRep
    AC --> ARep

    IRep --> IGH
    IGH --> TInit
    TInit --> TPlan
    TPlan --> TApply
    TApply --> KConfig
    KConfig --> NGinst

    ARep --> AGH
    AGH --> Test
    Test --> CS
    CS --> SQ
    SQ --> DocBuild
    DocBuild --> DocPush
    DocPush --> HDepl

    TApply --> VPC
    TApply --> EKS
    TApply --> S3

    DocPush --> ECR

    HDepl --> ING
    HDepl --> APP
    HDepl --> DB
    HDepl --> MC
    HDepl --> RMQ

    NGinst --> ING
    ING --> APP

    User --> ING

The workflow proceeds as follows:

Infrastructure changes trigger the Terraform workflow to create or update AWS resources
Application changes trigger the application workflow for testing, building, and deployment
The application is deployed to the EKS cluster created by the infrastructure pipeline
Users access the application through the NGINX ingress controller

Security and Best Practices

Throughout this project, we've implemented several security best practices:

Least Privilege: Using IAM roles with minimal permissions
Infrastructure Segregation: Separating public and private subnets
Secrets Management: Storing sensitive information in GitHub Secrets
Image Security: Using multi-stage builds to minimize attack surface
Code Quality: Implementing automated testing and code analysis

Challenges and Learning Outcomes

Building this project presented several interesting challenges:

Terraform State Management: Learning to manage state files securely using S3 backends
Kubernetes Networking: Understanding the intricacies of Kubernetes ingress and service discovery
CI/CD Integration: Connecting multiple pipelines with appropriate dependencies
Container Optimization: Creating efficient Docker images using multi-stage builds

Conclusion

The vProfile project demonstrates a comprehensive implementation of modern DevOps principles and practices. By leveraging Infrastructure as Code, containerization, Kubernetes orchestration, and CI/CD pipelines, we've created a robust, scalable, and easily maintainable deployment pipeline.

This approach offers several key benefits:

Speed: Automated deployments reduce time-to-market
Consistency: Infrastructure and application deployments are reproducible
Scalability: Kubernetes allows for easy scaling of application components
Maintainability: Code-based infrastructure and pipelines simplify maintenance
Resilience: Multi-AZ deployment ensures high availability

The knowledge and skills gained from building this project provide a solid foundation for implementing similar architectures in other enterprise contexts. Understanding these DevOps principles from first principles enables you to adapt these patterns to various cloud platforms and application architectures.

Future Enhancements

While the current implementation is robust, several enhancements could further improve the system:

Multiple Environments: Extend the infrastructure to support development, staging, and production
Advanced Monitoring: Implement comprehensive monitoring with Prometheus and Grafana
Service Mesh: Add Istio or Linkerd for advanced traffic management and security
GitOps: Implement ArgoCD or Flux for GitOps-based continuous deployment
Automated Testing: Add more comprehensive integration and end-to-end tests

By continuing to evolve this architecture, we can create an even more powerful and flexible DevOps platform.

This project was developed as a major project for college by Nikhil Mishra. The source code is available in the iac-vprofile and vprofile-action repositories.

AushadhiAI: AI-Powered Prescription Analysis System

Nikhil Mishra — Sat, 08 Mar 2025 04:04:22 GMT

Introduction

AushadhiAI is an innovative solution that leverages artificial intelligence and Azure Computer Vision to decode doctors' handwritten prescriptions, making medication information accessible and understandable for patients. This technical deep dive explores the architecture, implementation details, and key features of the AushadhiAI system.

System Architecture

AushadhiAI employs a modern, scalable architecture that separates frontend and backend concerns while leveraging cloud services for advanced AI capabilities.

flowchart TD
    User[User] -->|Uploads Prescription| Frontend[Frontend UI]
    Frontend -->|HTTP Request| Backend[Backend API]
    Backend -->|Image Analysis| AzureCV[Azure Computer Vision]
    Backend -->|Medication Lookup| MedDB[(Medication Database)]
    AzureCV -->|OCR Results| Backend
    Backend -->|JSON Response| Frontend
    Frontend -->|Display Results| User

Key Components

Frontend: HTML, CSS, and JavaScript providing a responsive user interface
Backend API: FastAPI application providing RESTful endpoints for image analysis
Azure Vision Service: Cloud-based OCR through Azure Computer Vision API
Medication Service: Logic for identifying medications from extracted text
Medication Database: JSON-based storage of medication information

Technical Implementation Details

Backend System

The backend is built with FastAPI, a modern, high-performance web framework for building APIs with Python. It provides several key endpoints:

classDiagram
    class FastAPIApp {
        +read_root()
        +health_check()
        +get_medications()
        +get_medication_details(name)
        +analyze_prescription(file)
        +get_sample_medications()
        +check_azure()
    }

    class OCRService {
        +extract_text(image_bytes)
    }

    class MedicationService {
        +get_all_medication_names()
        +get_medication_details(name)
    }

    class AzureVisionService {
        +is_available
        +extract_text(image_bytes)
        -_fallback_extract_text(image_bytes)
    }

    class RxNormService {
        +lookup_medication(name)
    }

    FastAPIApp --> OCRService
    FastAPIApp --> MedicationService
    FastAPIApp --> AzureVisionService
    FastAPIApp --> RxNormService

Key Backend Components:

app.py: Main FastAPI application that defines all endpoints
services/azure_vision_service.py: Handles communication with Azure Computer Vision API
services/ocr_service.py: Manages text extraction from images with fallback mechanisms
services/med_service.py: Identifies medications from extracted text
services/rxnorm_service.py: Integrates with RxNorm for standardized medication information

Prescription Analysis Process

The prescription analysis workflow involves several steps:

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    participant AzureVision
    participant MedService

    User->>Frontend: Upload prescription image
    Frontend->>Backend: POST /api/analyze
    Backend->>AzureVision: extract_text(image)
    alt Azure available
        AzureVision-->>Backend: OCR text results
    else Azure unavailable
        AzureVision-->>Backend: Use fallback OCR
    end
    Backend->>MedService: Find medications in text
    MedService-->>Backend: Medication matches
    Backend-->>Frontend: Analysis results (JSON)
    Frontend-->>User: Display medication information

Frontend Implementation

The frontend provides an intuitive interface for users to upload and analyze prescriptions:

flowchart LR
    subgraph Frontend
        UI[User Interface] --> Upload
        Upload --> Analysis
        Analysis --> Results
    end

    subgraph Components
        Upload[Upload Component]
        Analysis[Analysis Process]
        Results[Results Display]
    end

    UI -->|User Interaction| Components

The interface includes:

Upload Section: For prescription image upload
Processing Visualization: Shows analysis progress
Results Display: Presents identified medications and details
Responsive Design: Works across desktop and mobile devices

Deployment Architecture

AushadhiAI is deployed using a modern cloud-based infrastructure:

flowchart TD
    subgraph "Frontend Deployment"
        GitHubPages[GitHub Pages]
    end

    subgraph "CI/CD Pipeline"
        GitHubActions[GitHub Actions]
    end

    subgraph "Backend Services"
        ElasticBeanstalk[AWS Elastic Beanstalk]
        ECR[Amazon ECR]
        CloudWatch[AWS CloudWatch]
    end

    subgraph "External Services"
        Azure[Azure Computer Vision]
    end

    GitHubActions -->|Deploy Frontend| GitHubPages
    GitHubActions -->|Deploy Backend| ECR
    ECR -->|Container Image| ElasticBeanstalk
    ElasticBeanstalk -->|Monitoring| CloudWatch
    ElasticBeanstalk -->|API Calls| Azure
    GitHubPages -->|API Requests| ElasticBeanstalk

Deployment Components:

Frontend: Hosted on GitHub Pages (static hosting)
Backend: Containerized with Docker and deployed on AWS Elastic Beanstalk
CI/CD: Automated deployment using GitHub Actions
Monitoring: AWS CloudWatch for performance and error tracking

System Features

1. Robust OCR Capabilities

The system uses Azure Computer Vision API for high-quality OCR, with a fallback mechanism for offline operation:

flowchart TD
    Start[Receive Image] -->|Process| AzureCheck{Azure Available?}
    AzureCheck -->|Yes| AzureOCR[Use Azure Vision API]
    AzureCheck -->|No| LocalOCR[Use Local OCR Fallback]
    AzureOCR --> TextExtraction[Extract Text]
    LocalOCR --> TextExtraction
    TextExtraction --> MedicationIdentification[Identify Medications]

2. Medication Identification

The system identifies medications using a combination of techniques:

flowchart LR
    OCRText[OCR Text] --> Preprocessing[Text Preprocessing]
    Preprocessing --> NameMatching[Medication Name Matching]
    NameMatching --> Validation[Validation]
    Validation --> DosageExtraction[Dosage Information Extraction]
    DosageExtraction --> Results[Medication Results]

3. Error Handling and Resilience

The system is designed with robust error handling:

flowchart TD
    Request[API Request] --> Validation{Input Valid?}
    Validation -->|Yes| Processing[Process Request]
    Validation -->|No| Error400[Return 400 Error]
    Processing --> ServiceCheck{Services Available?}
    ServiceCheck -->|Yes| SuccessfulResponse[Return Response]
    ServiceCheck -->|No| FallbackMechanism[Use Fallback]
    FallbackMechanism --> LimitedResponse[Return Limited Response]

Performance Considerations

graph LR
    A[Image Upload] --> B[Image Preprocessing]
    B --> C[OCR Processing]
    C --> D[Medication Identification]
    D --> E[Response Generation]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#bbf,stroke:#333,stroke-width:4px
    style D fill:#bbf,stroke:#333,stroke-width:4px

The most computationally intensive parts of the system are:

OCR Processing: Handled by Azure Computer Vision to offload processing
Medication Identification: Optimized with efficient text matching algorithms
Image Preprocessing: Used to enhance OCR accuracy

Security Implementation

flowchart TD
    subgraph "Security Measures"
        CORS[CORS Policy]
        InputValidation[Input Validation]
        APIKeys[API Key Management]
        ErrorHandling[Error Handling]
    end

    Request[User Request] --> CORS
    CORS --> InputValidation
    InputValidation --> Processing[Request Processing]
    Processing --> APIKeys
    APIKeys --> ExternalService[External Services]
    Processing --> ErrorHandling
    ErrorHandling --> Response[Secure Response]

Key security considerations:

CORS Configuration: Prevents unauthorized cross-origin requests
Input Validation: Sanitizes and validates all user input
API Key Management: Securely stores and manages Azure API keys
Error Handling: Prevents information leakage in error responses

Future Enhancements

The system is designed for extensibility, with planned enhancements:

timeline
    title Development Roadmap
    Phase 1 : Basic OCR and Medication Identification
    Phase 2 : Detailed Medication Information
    Phase 3 : User Accounts and Prescription History
    Phase 4 : Mobile Application Development
    Phase 5 : Pharmacy System Integration
    Phase 6 : Multi-language Support

Conclusion

AushadhiAI represents a powerful application of AI technology to solve real-world healthcare challenges. By combining Azure Computer Vision's advanced OCR capabilities with custom medication identification algorithms, the system effectively bridges the gap between handwritten prescriptions and patient understanding.

The architecture balances performance, reliability, and user experience, with careful consideration given to fallback mechanisms that ensure the system remains functional even when cloud services are unavailable.

Through its modern deployment architecture and thoughtful technical implementation, AushadhiAI demonstrates how cloud-native applications can deliver meaningful solutions to everyday problems.

From Logs to Insights: Unleashing the Power of Nginx Log Analysis

Nikhil Mishra — Thu, 06 Mar 2025 06:34:56 GMT

Introduction

In the world of web server operations, log files represent the ground truth of what's happening on your servers. Every request, response, error, and interaction is meticulously recorded, creating a treasure trove of operational intelligence waiting to be unlocked. Yet, the sheer volume and cryptic format of these logs make them inaccessible without proper analysis techniques.

This blog post explores the fundamental principles behind log analysis, specifically focusing on Nginx access logs. By taking a first principles approach, we'll deconstruct not just how to analyze logs, but why certain patterns and methodologies yield valuable insights that can transform raw data into actionable information.

Understanding Web Server Logs from First Principles

The Anatomy of a Log Entry

At its most fundamental level, a web server log entry is an event record that captures a specific interaction between a client and a server. Let's break down the standard Nginx combined log format from first principles:

127.0.0.1 - frank [10/Oct/2023:13:55:36 +0000] "GET /index.html HTTP/1.1" 200 2326 "http://example.com/start.html" "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"

Each component of this log entry serves a specific purpose:

graph TD
    A[Complete Log Entry] --> B[Client Identity]
    A --> C[Authentication]
    A --> D[Timestamp]
    A --> E[HTTP Request]
    A --> F[Response Status]
    A --> G[Response Size]
    A --> H[Referrer]
    A --> I[User Agent]

    B --> B1[IP Address]
    C --> C1[Basic Auth User]
    D --> D1[Request Time]
    E --> E1[HTTP Method]
    E --> E2[Resource Path]
    E --> E3[Protocol]
    F --> F1[HTTP Status Code]
    G --> G1[Bytes Transferred]
    H --> H1[Referring URL]
    I --> I1[Browser/Client Info]

    style A fill:#f96,stroke:#333,stroke-width:4px

This structured format represents a deliberate design choice: each field is positioned to capture a specific aspect of the HTTP transaction, creating a comprehensive record that can be parsed and analyzed systematically.

The Information Theory of Logs

From an information theory perspective, logs represent a form of data compression. Each log entry encodes a complex event (an HTTP transaction with multiple dimensions) into a single line of text. This compression is lossy—not every aspect of the transaction is recorded—but it preserves the most critical information needed for operational analysis.

The challenge lies in extracting and aggregating this information efficiently. This is where our Nginx Log Analyser comes in.

Architectural Design from First Principles

The architecture of our log analysis tool follows a pipeline pattern, where data flows through a series of transformations:

flowchart LR
    A[Raw Log File] --> B[Extraction]
    B --> C[Aggregation]
    C --> D[Sorting]
    D --> E[Filtering]
    E --> F[Presentation]

    style A fill:#bbf,stroke:#333,stroke-width:2px
    style F fill:#bfb,stroke:#333,stroke-width:2px

This pipeline architecture has several inherent advantages:

Each stage has a single responsibility
Stages can be optimized independently
The process can be parallelized if needed
New transformations can be added without changing others

Let's explore each component of this architecture in detail.

Data Extraction: The Foundation of Analysis

The first challenge in log analysis is extracting structured data from semi-structured text. Our tool uses the awk command to parse the log format:

# Extract IP addresses
awk '{print $1}' "$LOG_FILE"

# Extract request paths
awk -F'"' '{print $2}' "$LOG_FILE" | awk '{print $2}'

# Extract status codes
awk '{print $9}' "$LOG_FILE"

# Extract user agents
awk -F'"' '{print $6}' "$LOG_FILE"

This approach demonstrates an important principle: effective data extraction requires understanding the structure of your data. By using field separators (-F'"') and positional arguments, we can precisely target specific components of the log entry.

Aggregation and Frequency Analysis

Once we've extracted the raw fields, we need to count occurrences to understand patterns. The sort | uniq -c pipeline is a powerful pattern for frequency analysis:

awk '{print $1}' "$LOG_FILE" | sort | uniq -c

This pattern demonstrates a key principle in data analysis: transforming individual observations into aggregate statistics reveals patterns that are invisible at the individual level.

From a statistical perspective, this is a form of frequency distribution analysis—we're creating a histogram of occurrences to identify the most common values.

Data Sorting and Selection

After aggregation, we need to prioritize the most significant findings:

sort -rn | head -5

This simple command pair embodies an important analytical principle: ranking and selection help manage information overload. By sorting numerically in reverse order (-rn) and limiting results (head -5), we focus attention on the most statistically significant patterns.

The Execution Flow: Sequence and Processing

The complete processing flow of our tool follows this sequence:

sequenceDiagram
    participant User
    participant Script
    participant LogFile
    participant AWK
    participant Sort
    participant Uniq
    participant Head

    User->>Script: Execute with log file
    Script->>Script: Validate arguments
    Script->>LogFile: Read log file

    Note over Script,Head: IP Address Analysis
    Script->>AWK: Extract IP addresses
    AWK->>Sort: Pipe extracted IPs
    Sort->>Uniq: Count unique IPs
    Uniq->>Sort: Sort by frequency
    Sort->>Head: Select top 5
    Head->>Script: Return results
    Script->>User: Display top IPs

    Note over Script,Head: Request Path Analysis
    Script->>AWK: Extract request paths
    AWK->>AWK: Process field 2
    AWK->>Sort: Pipe extracted paths
    Sort->>Uniq: Count unique paths
    Uniq->>Sort: Sort by frequency
    Sort->>Head: Select top 5
    Head->>Script: Return results
    Script->>User: Display top paths

    Note over Script,Head: Status Code Analysis
    Script->>AWK: Extract status codes
    AWK->>Sort: Pipe extracted codes
    Sort->>Uniq: Count unique codes
    Uniq->>Sort: Sort by frequency
    Sort->>Head: Select top 5
    Head->>Script: Return results
    Script->>User: Display top status codes

    Note over Script,Head: User Agent Analysis
    Script->>AWK: Extract user agents
    AWK->>Sort: Pipe extracted agents
    Sort->>Uniq: Count unique agents
    Uniq->>Sort: Sort by frequency
    Sort->>Head: Select top 5
    Head->>Script: Return results
    Script->>User: Display top user agents

This sequence diagram reveals an important architectural pattern: the same data transformation pipeline is applied to different aspects of the log data, creating a consistent analytical approach across dimensions.

Statistical Insights from Log Analysis

The output of our analysis provides four distinct views into server activity:

Top IP Addresses: Identifies potential heavy users, bots, or attackers
Top Requested Paths: Reveals the most popular content or potential targets
Top Response Codes: Indicates the overall health and common issues
Top User Agents: Shows which clients/browsers are most common

These four dimensions create a multi-faceted view of server activity:

graph TD
    A[Nginx Log Analysis] --> B[Traffic Sources]
    A --> C[Content Popularity]
    A --> D[Server Health]
    A --> E[Client Demographics]

    B --> B1[Top IP Addresses]
    C --> C1[Top Requested Paths]
    D --> D1[Top Response Codes]
    E --> E1[Top User Agents]

    style A fill:#f96,stroke:#333,stroke-width:4px

From a data science perspective, this approach demonstrates the power of dimensional analysis—examining the same dataset through different lenses to reveal complementary insights.

The Mathematics of Log Analysis

The underlying mathematical principles of our analysis are based on frequency counting and ranking. If we consider the set of all log entries L, and a function f that extracts a specific field (such as IP address) from each entry, the counting process can be expressed as:

For any value v in the range of f, the count C(v) is:

C(v) = |{l ∈ L : f(l) = v}|

We then sort these counts in descending order and select the top k values:

TopK(f, L, k) = First k elements of Sort({(v, C(v)) : v in range of f(L)})

This mathematical formulation reveals that our seemingly simple shell commands are implementing a sophisticated statistical aggregation and ranking algorithm.

Performance Considerations: The Time-Space Tradeoff

The Unix pipeline architecture of our tool makes an important tradeoff: it processes data sequentially, requiring minimal memory but potentially more CPU time. For most log files, this is an appropriate tradeoff, as it allows analysis of logs too large to fit in memory.

The time complexity of our approach is approximately:

Extraction: O(n) where n is the number of log entries
Sorting: O(n log n)
Counting: O(n)
Final sorting: O(k log k) where k is the number of unique values
Selection: O(k)

The dominant factor is the O(n log n) sorting step, which is necessary for the frequency analysis.

Beyond Basic Analysis: A Path Forward

From our first principles analysis, several natural extensions emerge:

mindmap
  root((Log Analysis))
    Temporal Analysis
      Request patterns by hour
      Daily traffic trends
      Session duration
    Geographic Insights
      IP geolocation
      Regional traffic patterns
    Performance Metrics
      Response time analysis
      Bandwidth consumption
      Cache effectiveness
    Security Analysis
      Attack pattern detection
      Anomaly identification
      Bot traffic filtering
    Content Analysis
      Path structure mapping
      Content popularity by type
      Error frequency by section
    User Behavior
      Session tracking
      Navigation paths
      Conversion funnels

Each of these extensions builds upon the fundamental principles established in our base tool but adds new dimensions of analysis that can provide deeper insights.

Shell Scripting as a Data Science Tool

It's worth noting that our approach uses basic Unix shell commands to perform what would typically be considered data science tasks:

Data extraction (awk)
Aggregation (sort, uniq)
Sorting and ranking (sort -rn)
Selection (head)

This demonstrates an important principle: powerful analysis doesn't always require complex tools. The Unix philosophy of "small tools that do one thing well" creates a flexible analytical framework when these tools are combined effectively.

Conclusion: From Logs to Insight

By approaching log analysis from first principles, we've seen how a seemingly simple task—counting occurrences in a text file—can reveal sophisticated patterns and insights about web server operation. Our Nginx Log Analyser demonstrates how fundamental computational and statistical techniques can transform raw data into actionable intelligence.

The power of this approach lies not just in what it can tell us about our servers today, but in how it establishes a foundation for more sophisticated analysis. By understanding the basic principles of extraction, aggregation, and ranking, we can build increasingly powerful analytical tools that help us understand and optimize our web infrastructure.

In a world increasingly driven by data, the ability to extract meaningful patterns from raw logs is not just a technical skill—it's a competitive advantage. By mastering these fundamental techniques, we unlock the valuable information hidden in plain sight in our server logs.

About the Author

I'm a DevOps engineer and systems architect passionate about applying first principles thinking to infrastructure analysis and optimization. This project is part of the roadmap.sh learning path for server administration and monitoring.

For more information about log analysis best practices, visit roadmap.sh/projects/nginx-log-analyser

Log Management & Archiving: A First Principles Deep Dive

Nikhil Mishra — Wed, 05 Mar 2025 11:22:58 GMT

Introduction

Log files are the silent sentinels of our systems—recording events, errors, and activities that are crucial for troubleshooting, security auditing, and compliance. Yet, they present a unique challenge: they're essential to keep, but they consume resources and can quickly become unwieldy. In this blog post, we'll explore the fundamental principles behind efficient log management and how our Log Archive Tool addresses these challenges from first principles.

Understanding Log Management from First Principles

The Fundamental Problem Space

At its core, log management is about balancing several competing concerns:

Information Preservation: Maintaining historical records of system activities and events
Resource Optimization: Preventing log files from consuming excessive disk space
Retrieval Efficiency: Ensuring logs remain accessible when needed for analysis
Security & Compliance: Safeguarding sensitive information while meeting retention requirements

Rather than treating log management as a mundane operational task, we can approach it as an information lifecycle management problem with specific constraints and objectives.

The Log Lifecycle

From a first principles perspective, logs undergo a predictable lifecycle:

graph LR
    A[Log Creation] --> B[Active Use]
    B --> C[Retention Period]
    C --> D[Archive]
    D --> E[Eventual Disposal]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#bfb,stroke:#333,stroke-width:2px
    style D fill:#fbb,stroke:#333,stroke-width:2px
    style E fill:#ddd,stroke:#333,stroke-width:2px

Understanding this lifecycle reveals why most log management solutions fail: they focus on only one part of this cycle (usually creation and active use) without considering the entire information flow.

Technical Architecture of the Log Archive Tool

Our Log Archive Tool approaches the problem holistically, addressing multiple stages of the log lifecycle in a single, cohesive solution:

flowchart TB
    A[Log Archive Tool] --> B[Archiving Engine]
    A --> C[Notification System]
    A --> D[Secure Transfer Mechanism]

    B --> B1[File Compression]
    B --> B2[Timestamp Generation]
    B --> B3[Archive Creation]

    C --> C1[Email Notification]
    C --> C2[Status Reporting]

    D --> D1[SSH Authentication]
    D --> D2[SCP Transfer]
    D --> D3[Remote Storage]

    style A fill:#f96,stroke:#333,stroke-width:2px
    style B fill:#9cf,stroke:#333,stroke-width:2px
    style C fill:#9f9,stroke:#333,stroke-width:2px
    style D fill:#f99,stroke:#333,stroke-width:2px

This architecture separates concerns while ensuring that each component works in concert with the others. Let's explore each component in detail.

Compression: Information Theory in Practice

From an information theory perspective, log files are ideal candidates for compression because they often contain repeated patterns and redundant information. Our tool leverages the gzip compression algorithm through tar:

sudo tar -czf "$ARCHIVE_DIR/$ARCHIVE_FILENAME" -C "$LOG_DIR" .

This single line represents a sophisticated compression process:

The -c flag creates a new archive
The -z flag applies gzip compression
The -f flag specifies the output file

By compressing log files, we typically achieve compression ratios of 5:1 to 10:1, dramatically reducing storage requirements while preserving all information. This is not just a practical benefit but a fundamental application of information theory.

Timestamp Generation: The Importance of Temporal Context

Time is a critical dimension in log analysis. Our tool generates unique timestamps using the command:

TIMESTAMP=$(date +"%Y%m%d_%H%M%S")

This format ensures:

Chronological sorting: Archives naturally sort in chronological order
Unambiguous identification: Each archive has a unique identifier
Human readability: The format is easily interpretable

From first principles, this timestamp serves as both an identifier and metadata, embedding temporal context directly into the artifact name.

The Execution Flow: A Sequence-Based Approach

The tool's execution flow follows a logical sequence that minimizes failure points and ensures data integrity:

sequenceDiagram
    participant User
    participant Script
    participant LocalSystem
    participant Email
    participant RemoteSystem

    User->>Script: Execute with parameters
    Script->>Script: Validate input parameters
    Script->>LocalSystem: Check directory existence
    LocalSystem-->>Script: Directory status

    alt Directory does not exist
        Script->>User: Error message
    else Directory exists
        Script->>LocalSystem: Create archive directory
        Script->>LocalSystem: Generate timestamp
        Script->>LocalSystem: Compress logs
        LocalSystem-->>Script: Compression complete

        Script->>Email: Send notification
        Email-->>Script: Email sent

        Script->>Script: Parse remote destination
        Script->>RemoteSystem: Transfer archive via SCP
        RemoteSystem-->>Script: Transfer status

        Script->>User: Display completion status
    end

This sequence diagram reveals an important architectural principle: the tool handles error conditions early and proceeds only when preconditions are met, creating a robust execution path.

Remote Backup: The Distributed Systems Approach

Perhaps the most sophisticated aspect of our tool is its approach to distributed storage. By leveraging SCP (Secure Copy Protocol), the tool ensures:

Data Integrity: Files are transferred without corruption
Security: All data is encrypted during transit
Authentication: SSH key-based authentication prevents unauthorized access

The implementation parses the remote destination string to extract the necessary components:

REMOTE_USER=$(echo "$REMOTE_BACKUP" | cut -d'@' -f1)
REMOTE_HOST=$(echo "$REMOTE_BACKUP" | cut -d'@' -f2 | cut -d':' -f1)
REMOTE_PATH=$(echo "$REMOTE_BACKUP" | cut -d':' -f2)

This parsing demonstrates a key principle in distributed systems: the separation of identity (user), location (host), and storage (path) as distinct components that together form a complete resource identifier.

Notification System: Closing the Feedback Loop

From a cybernetic perspective, any effective system requires feedback loops. Our notification system serves this purpose:

{
    echo "Subject: Log Archive Notification"
    echo "To: $EMAIL"
    echo "Content-Type: text/plain"
    echo ""
    echo "Logs have been archived successfully on $(date)."
} | sendmail -t

This seemingly simple notification achieves several important functions:

It confirms successful operation
It provides an audit trail of archiving activities
It alerts administrators to potential issues if expected notifications don't arrive

System Integration Architecture

When viewed as part of a broader system, our Log Archive Tool occupies a specific position in the infrastructure architecture:

graph TD
    A[Application Servers] --> B[Log Files]
    B --> C[Log Archive Tool]
    C --> D[Local Archive]
    C --> E[Email Notification]
    C --> F[Remote Backup Storage]

    G[Monitoring System] -.-> E
    H[Disaster Recovery] -.-> F
    I[Compliance Audit] -.-> F

    style C fill:#f96,stroke:#333,stroke-width:4px

This architectural view reveals how our tool serves as a critical junction between active systems and various downstream consumers of log data, from monitoring systems to compliance auditors.

Security Considerations from First Principles

Security is not an add-on but a fundamental aspect of any log management solution. From first principles, we can identify several security requirements:

Confidentiality: Logs often contain sensitive information
Integrity: Logs must not be tampered with
Availability: Logs must be accessible when needed
Non-repudiation: The authenticity of logs must be verifiable

Our tool addresses these through:

Executing with elevated permissions (sudo) to access protected logs
Using SSH for secure, encrypted transfers
Preserving file ownership and permissions during archiving
Creating immutable archives with timestamps

Future Extensions: Evolutionary Architecture

From our first principles analysis, several natural extensions emerge:

mindmap
  root((Log Archive Tool))
    Retention Policies
      Time-based expiration
      Space-based cleanup
    Enhanced Compression
      Deduplication
      Differential archiving
    Security Enhancements
      Cryptographic signing
      Encryption at rest
    Analytics Integration
      Automated log parsing
      Anomaly detection
    Scalability
      Multi-server coordination
      Distributed processing
    Cloud Integration
      S3/Azure/GCP storage
      Serverless triggers

These extensions follow naturally from the core principles we've established, showing how a first principles approach enables organic, coherent system evolution.

Conclusion: The Art of Log Management

By approaching log management from first principles, we've transformed what might seem like a mundane operational task into a sophisticated information lifecycle management solution. Our Log Archive Tool embodies these principles through:

Efficiency: Minimizing resource usage through compression
Reliability: Ensuring logs are preserved through multiple mechanisms
Security: Protecting sensitive information throughout the process
Usability: Simplifying operations with clear interfaces and feedback

The true art of systems engineering lies not in complexity but in finding elegant solutions to fundamental problems. By understanding the first principles of log management, we've created a tool that's both powerful and remarkably simple.

About the Author

I'm a systems engineer passionate about infrastructure automation and applying first principles thinking to DevOps challenges. This project is part of the roadmap.sh learning path for server administration.

For more information about log management best practices, visit roadmap.sh/projects/log-archive-tool

Server Monitoring From First Principles: Building a Custom Server-Stats Tool

Nikhil Mishra — Tue, 04 Mar 2025 11:04:56 GMT

Introduction

In today's complex IT infrastructure, understanding server performance is not just a convenience—it's a necessity. This blog post explores the fundamental principles behind server monitoring and dives deep into how we built a comprehensive server statistics tool from scratch.

As engineers, we often rely on sophisticated monitoring tools without understanding their inner workings. By breaking down our approach to first principles, we'll gain insights into not just how to monitor servers, but why specific metrics matter and how they interrelate.

Understanding the Fundamentals of Server Monitoring

Why Monitor Servers?

At its core, server monitoring solves several critical problems:

Proactive Issue Detection: Identifying problems before they impact users
Performance Optimization: Finding bottlenecks that limit system capability
Capacity Planning: Understanding resource utilization trends
Security Oversight: Detecting unusual patterns that may indicate breaches

The foundation of effective monitoring lies in knowing which metrics truly matter. Let's break this down by examining the fundamental resources every server manages.

The Four Pillars of Server Resources

From first principles, every server manages four essential resources:

graph TD
    A[Server Resources] --> B[CPU]
    A --> C[Memory]
    A --> D[Disk]
    A --> E[Network]

    B --> B1[Processing Power]
    B --> B2[Task Scheduling]

    C --> C1[Data Storage]
    C --> C2[Application State]

    D --> D1[Persistent Storage]
    D --> D2[I/O Operations]

    E --> E1[Data Transfer]
    E --> E2[Communication]

Understanding how these resources interact and depend on each other is crucial. For example, insufficient memory can lead to excessive disk swapping, creating an I/O bottleneck that appears as a disk problem but originates from memory constraints.

Architecture of Our Monitoring Solution

Our server-stats tool follows a modular design pattern where each component focuses on monitoring a specific aspect of the system:

flowchart TD
    A[server-stats.sh] --> B[System Information]
    A --> C[Resource Monitoring]
    A --> D[Security Metrics]

    B --> B1[OS Version]
    B --> B2[System Uptime]
    B --> B3[Load Average]
    B --> B4[Current Date/Time]

    C --> C1[CPU Usage]
    C --> C2[Memory Utilization]
    C --> C3[Disk Space Analysis]
    C --> C4[Top Processes]

    D --> D1[Active User Sessions]
    D --> D2[Failed Login Attempts]
    D --> D3[Auth Log Analysis]

This architecture allows for:

Independent development of each module
Easy extensibility to add new metrics
Clear separation of concerns

Deep Dive: Implementation From First Principles

CPU Monitoring

From first principles, CPU usage is fundamentally about time allocation. When we measure CPU percentage, we're asking: "Of the total available CPU time, how much was spent on actual work versus waiting?"

The Linux kernel tracks CPU time in multiple categories:

user: Time spent running user space processes
nice: Time spent running niced processes (with adjusted priority)
system: Time spent in kernel operations
idle: Time when CPU had nothing to process
iowait: Time spent waiting for I/O operations
irq/softirq: Time handling interrupts

Our implementation extracts this data directly from /proc/stat:

function cpu_usage {
    echo "=== Total CPU Usage ==="
    cpu_info=$(grep 'cpu ' /proc/stat)
    user=$(echo $cpu_info | awk '{print $2}')
    nice=$(echo $cpu_info | awk '{print $3}')
    system=$(echo $cpu_info | awk '{print $4}')
    idle=$(echo $cpu_info | awk '{print $5}')

    total=$((user + nice + system + idle))
    used=$((user + nice + system))

    cpu_percentage=$((100 * used / total))

    echo "CPU Usage: $cpu_percentage%"
    echo ""
}

While simplified, this function captures the essence of CPU monitoring by calculating the ratio of active time (user + nice + system) to total time.

Memory Monitoring From First Principles

Memory is fundamentally about allocation of finite storage space. The key insight is understanding the difference between available memory, used memory, and how the system manages memory pressure.

In Linux, the free command provides this information in an accessible format:

function memory_usage {
    echo "=== Total Memory Usage ==="
    mem_info=$(free -m)

    total_mem=$(echo "$mem_info" | awk 'NR==2{print $2}')
    used_mem=$(echo "$mem_info" | awk 'NR==2{print $3}')

    free_mem=$((total_mem - used_mem))

    mem_percentage=$((100 * used_mem / total_mem))

    echo "Total Memory: ${total_mem}MB"
    echo "Used Memory: ${used_mem}MB"
    echo "Free Memory: ${free_mem}MB"
    echo "Memory Usage Percentage: ${mem_percentage}%"
    echo ""
}

It's worth noting that modern Linux kernels have sophisticated memory management that includes caching frequently used data. A more comprehensive analysis would distinguish between memory used by applications and memory used for cache, which can be released if needed.

Execution Flow

The overall execution flow of our server monitoring tool follows a sequential pattern:

sequenceDiagram
    participant User
    participant Script
    participant System

    User->>Script: Execute ./server-stats.sh
    Script->>System: Request date information
    System-->>Script: Return current date
    Script->>User: Display date

    Script->>System: Request OS version
    System-->>Script: Return OS details
    Script->>User: Display OS version

    Script->>System: Request uptime data
    System-->>Script: Return uptime
    Script->>User: Display uptime

    Script->>System: Check load average
    System-->>Script: Return load statistics
    Script->>User: Display load average

    Script->>System: Query user sessions
    System-->>Script: Return active sessions
    Script->>User: Display logged in users

    Script->>System: Check security logs
    System-->>Script: Return failed login count
    Script->>User: Display failed attempts

    Script->>System: Request CPU statistics
    System-->>Script: Return CPU data
    Script->>User: Display CPU usage

    Script->>System: Query memory allocation
    System-->>Script: Return memory statistics
    Script->>User: Display memory usage

    Script->>System: Request disk information
    System-->>Script: Return disk statistics
    Script->>User: Display disk usage

    Script->>System: Query resource-intensive processes
    System-->>Script: Return top processes
    Script->>User: Display top processes

Security Considerations

From first principles, system security involves detecting anomalies and unauthorized access attempts. Our tool incorporates basic security monitoring:

Active Sessions: Shows who is currently logged in, allowing administrators to identify unexpected users
Failed Login Attempts: A sudden increase in failed logins often indicates a brute force attack

function failed_login_attempts {
    echo "=== Failed Login Attempts ==="
    sudo cat /var/log/auth.log | grep 'Failed password' | wc -l
    echo ""
}

A more comprehensive solution would include:

Tracking login attempts by IP address
Monitoring for privilege escalation
Detecting unusual file system access patterns
Checking for modifications to critical system files

Extending The System: Future Directions

From our first principles approach, several enhancements naturally emerge:

mindmap
  root((Server Monitoring))
    Real-time Monitoring
      Continuous data collection
      Time-series visualization
    Alerting System
      Threshold-based alerts
      Anomaly detection
    Historical Analysis
      Performance trending
      Capacity forecasting
    Network Monitoring
      Bandwidth utilization
      Connection tracking
    Service Monitoring
      API health checks
      Database performance
    Distributed Systems
      Cross-server correlation
      Service mesh analysis

Conclusion

Building a server monitoring system from first principles reveals the fundamental relationships between computing resources and provides deeper insights into system behavior. Our server-stats tool, while simple, demonstrates the core concepts behind effective monitoring.

By understanding the why behind each metric, not just the how, we develop more intuitive insights into server performance and can make better-informed decisions about optimization, scaling, and troubleshooting.

The journey from raw system data to actionable intelligence begins with these fundamentals. As we build more sophisticated monitoring solutions, these principles remain the foundation upon which all effective observability is built.

About The Author

I'm a system engineer and DevOps enthusiast passionate about understanding complex systems from first principles. This project is part of the roadmap.sh learning path for server administration and monitoring.

For more information about server monitoring and best practices, visit roadmap.sh/projects/server-stats

Visit Github repo for the code, visit (https://github.com/kaalpanikh/server-stats)

Securing the Cloud: Mastering SSH Access on AWS

Nikhil Mishra — Mon, 03 Mar 2025 10:53:21 GMT

Introduction

In the world of server administration and cloud computing, secure remote access is of paramount importance. This blog post guides you through the entire process of setting up secure SSH (Secure Shell) access to a remote Linux server on AWS, using multiple SSH key pairs for authentication, and implementing additional security measures like fail2ban to protect against brute force attacks.

By the end of this guide, you'll have a comprehensive understanding of SSH key management, server configuration, and security best practices that you can apply to your own projects.

Understanding SSH from First Principles

What is SSH?

SSH (Secure Shell) is a cryptographic network protocol that enables secure communication between two computers over an unsecured network. Unlike its predecessors like Telnet, SSH encrypts all traffic, protecting against eavesdropping and man-in-the-middle attacks.

The Cryptographic Foundation of SSH

SSH security is built on public-key cryptography. This system uses a pair of keys:

Private Key: Kept secret on your local machine
Public Key: Shared with remote servers

These keys work together through asymmetric encryption:

Messages encrypted with the public key can only be decrypted with the corresponding private key
The private key is used to generate digital signatures that can be verified with the public key

This creates a secure system where:

The server knows it's talking to the authorized client (authentication)
All communication is encrypted (confidentiality)
Messages cannot be altered in transit (integrity)

sequenceDiagram
    Client->>Server: 1. Connection Request
    Server->>Client: 2. Server Identity
    Client->>Server: 3. Key Exchange
    Client->>Server: 4. Authentication with Private Key
    Server->>Client: 5. Authentication Verification
    Client->>Server: 6. Encrypted Session Begins

Project Overview: Remote Server Setup with Multiple SSH Keys

Let's implement these concepts by setting up a remote Linux server on AWS with secure SSH access using two separate SSH key pairs. This approach demonstrates how you can manage different authentication credentials for the same server.

Our Architecture

graph TD
    subgraph "AWS Cloud"
        EC2["Amazon Linux EC2 Instance"]
        SG["Security Group"]
        Auth["~/.ssh/authorized_keys"]
        F2B["fail2ban"]
    end

    subgraph "Local Machine"
        Key1["SSH Key Pair 1"]
        Key2["SSH Key Pair 2"] 
        Config["SSH Config File"]
        Client["SSH Client"]
    end

    SG -->|"Allows Port 22"| EC2
    Key1 -->|"Public Key"| Auth
    Key2 -->|"Public Key"| Auth
    Auth -->|"Authenticates"| EC2
    Client -->|"SSH Connection"| EC2
    Config -->|"Configures"| Client
    F2B -->|"Protects"| EC2

    style EC2 fill:#f9f,stroke:#333,stroke-width:2px
    style SG fill:#bbf,stroke:#333,stroke-width:1px
    style Key1 fill:#bfb,stroke:#333,stroke-width:1px
    style Key2 fill:#bfb,stroke:#333,stroke-width:1px
    style F2B fill:#f66,stroke:#333,stroke-width:1px

Step 1: Provisioning the AWS EC2 Instance

The first step is to create your virtual server in the AWS cloud. EC2 (Elastic Compute Cloud) provides scalable computing capacity in the AWS cloud.

sequenceDiagram
    participant User
    participant AWS as AWS Console
    participant EC2 as EC2 Instance

    User->>AWS: Create EC2 Instance
    AWS->>EC2: Launch Amazon Linux
    AWS->>EC2: Configure Security Group
    AWS->>User: Provide Initial Key Pair
    User->>EC2: Initial SSH Connection
    Note over User,EC2: Using AWS-provided key pair

Step-by-Step EC2 Setup:

Log in to your AWS Console and navigate to the EC2 dashboard
Launch a new instance with the following specifications:
- AMI: Amazon Linux (a Linux distribution optimized for AWS)
- Instance Type: t2.micro (free tier eligible)
- Security Group: Create a new one with port 22 (SSH) open
- Key Pair: Create or select an existing key pair for initial access

Connect to your instance using the AWS-provided key:

ssh -i ~/.ssh/aws_key.pem ec2-user@your-instance-ip

💡 Why Amazon Linux? Amazon Linux is optimized for AWS, includes AWS tools by default, and receives regular security updates directly from Amazon. This makes it an excellent choice for AWS-hosted servers.

Step 2: Understanding and Generating SSH Key Pairs

SSH key pairs are the cryptographic credentials that allow secure authentication without passwords. Let's generate two separate key pairs for our server:

graph LR
    A["ssh-keygen command"] --> B["~/.ssh/my_first_key (Private)"]
    A --> C["~/.ssh/my_first_key.pub (Public)"]
    A --> D["~/.ssh/my_second_key (Private)"]
    A --> E["~/.ssh/my_second_key.pub (Public)"]

    style B fill:#f96,stroke:#333
    style D fill:#f96,stroke:#333
    style C fill:#9f6,stroke:#333
    style E fill:#9f6,stroke:#333

Generating SSH Keys from First Principles:

The ssh-keygen utility creates a mathematical key pair using public key cryptography algorithms. When generating keys, consider:

Key Type: RSA is widely supported, but newer algorithms like Ed25519 offer better security with smaller key sizes
Key Size: For RSA keys, 4096 bits provides strong security
Passphrase: An optional extra layer of security that encrypts your private key

Creating Our Keys:

# Generate first key pair
ssh-keygen -t rsa -b 4096 -f ~/.ssh/my_first_key -C "first-key"

# Generate second key pair
ssh-keygen -t rsa -b 4096 -f ~/.ssh/my_second_key -C "second-key"

Each command creates:

A private key file (my_first_key or my_second_key)
A public key file with .pub extension

⚠️ Security Alert: Your private key files should NEVER be shared with anyone or committed to repositories. They should have permissions set to 600 (readable only by your user).

Step 3: Server-Side SSH Configuration

Now we'll configure our remote server to accept both SSH key pairs for authentication.

graph LR
    A["Local: Public Keys"] -->|"Copy to Server"| B["Server: ~/.ssh/authorized_keys"]
    B -->|"Permissions: 600"| C["SSH Server"]

    subgraph "Server Configuration"
        B
        D["~/.ssh directory
Permissions: 700"]
    end

    style A fill:#bfb,stroke:#333
    style B fill:#bbf,stroke:#333
    style C fill:#f9f,stroke:#333
    style D fill:#bbf,stroke:#333

Understanding authorized_keys from First Principles:

The authorized_keys file contains a list of public keys that are allowed to authenticate. When an SSH client tries to connect:

The server reads authorized_keys
The client proves it has the corresponding private key
If proven, access is granted without a password

Adding Our Public Keys:

Display your public keys on your local machine:

cat ~/.ssh/my_first_key.pub
cat ~/.ssh/my_second_key.pub

Add to authorized_keys on the server:

# On your remote server
nano ~/.ssh/authorized_keys

# Paste both public keys, each on its own line
# Save and exit (Ctrl+X, Y, Enter in nano)

Set proper permissions:

chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys

💡 Why these permissions? The SSH daemon is highly security-conscious and will refuse to use keys if the permissions are too open. These permission settings ensure only the owner can read or modify the keys.

Step 4: Configuring Your Local SSH Client

To simplify connections, we'll create an SSH config file that specifies connection details and key locations.

graph LR
    A["~/.ssh/config file"] -->|"Contains"| B["Host alias configuration"]
    B -->|"Specifies"| C["Connection details"]
    C -->|"Includes"| D["HostName (IP)"]
    C -->|"Includes"| E["User"]
    C -->|"Includes"| F["IdentityFile paths"]

    G["ssh roadmapsh-test-server command"] -->|"Uses"| A
    G -->|"Connects to"| H["Remote Server"]

    style A fill:#bbf,stroke:#333
    style G fill:#bfb,stroke:#333
    style H fill:#f9f,stroke:#333

SSH Config from First Principles:

The SSH config file allows you to:

Create shortcuts for complex connection commands
Specify different keys for different servers
Set server-specific options

Creating the Config File:

# On your local machine
nano ~/.ssh/config

Add these lines:

Host roadmapsh-test-server
    HostName your-instance-ip
    User ec2-user
    IdentityFile ~/.ssh/my_first_key
    IdentityFile ~/.ssh/my_second_key

Save and exit. Now you can connect with:

ssh roadmapsh-test-server

SSH will automatically try each key in order until one works.

💡 Pro Tip: You can include additional options like ServerAliveInterval 60 to keep connections alive or Compression yes to speed up connections over slow networks.

Step 5: Enhancing Security with fail2ban

fail2ban is a security tool that monitors log files and automatically blocks IP addresses that show malicious signs, such as multiple failed login attempts.

flowchart TD
    A["Server Logs"] -->|"Monitored by"| B["fail2ban"]
    B -->|"Detects"| C["Suspicious Activity"]
    C -->|"Triggers"| D["Ban Action"]
    D -->|"Updates"| E["Firewall Rules"]
    E -->|"Blocks"| F["Attacking IP"]

    style A fill:#bbf,stroke:#333
    style B fill:#f9f,stroke:#333
    style C fill:#f66,stroke:#333
    style F fill:#f66,stroke:#333

How fail2ban Works from First Principles:

fail2ban:

Constantly monitors log files like /var/log/secure
Uses regular expressions to detect patterns like failed login attempts
Maintains counters for each IP address
When thresholds are exceeded, it updates firewall rules to block the IP
After a configurable ban time, it removes the block

Installing and Configuring fail2ban:

Install fail2ban:

sudo yum update -y
sudo yum install fail2ban -y

Create a custom configuration:

sudo cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local
sudo nano /etc/fail2ban/jail.local

Configure the SSH jail by ensuring these settings:

[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/secure
maxretry = 3
bantime = 3600  # 1 hour in seconds

Start and enable fail2ban:

sudo systemctl start fail2ban
sudo systemctl enable fail2ban

Verify it's working:
```
sudo fail2ban-client status sshd
```

💡 Understanding fail2ban jails: Each "jail" is a separate configuration for a specific service (like SSH). You can have different settings for different services, allowing fine-grained control over security policies.

Testing Our Setup

Let's verify that our setup works correctly:

sequenceDiagram
    participant User
    participant LocalSSH as Local SSH Client
    participant RemoteSSH as Remote SSH Server
    participant F2B as fail2ban

    User->>LocalSSH: ssh -i ~/.ssh/my_first_key ec2-user@IP
    LocalSSH->>RemoteSSH: Authenticate with first key
    RemoteSSH->>User: Successful login

    User->>LocalSSH: ssh -i ~/.ssh/my_second_key ec2-user@IP
    LocalSSH->>RemoteSSH: Authenticate with second key
    RemoteSSH->>User: Successful login

    User->>LocalSSH: ssh roadmapsh-test-server
    LocalSSH->>RemoteSSH: Try first key, then second key
    RemoteSSH->>User: Successful login

    User->>LocalSSH: Attempt with wrong key (3 times)
    LocalSSH->>RemoteSSH: Failed authentication
    RemoteSSH->>F2B: Log failed attempts
    F2B->>RemoteSSH: Ban IP after 3 failures

Test Cases:

Connect with the first key:

ssh -i ~/.ssh/my_first_key ec2-user@your-instance-ip

Connect with the second key:

ssh -i ~/.ssh/my_second_key ec2-user@your-instance-ip

Connect with the alias:
```
ssh roadmapsh-test-server
```
Test fail2ban (optional, and with caution): On a different machine, attempt to connect with incorrect credentials multiple times. Then check if your IP is banned:
```
sudo fail2ban-client status sshd
```

Security Best Practices

Throughout this tutorial, we've implemented several security best practices. Here's a summary:

Key Management Best Practices

Use strong keys: 4096-bit RSA or Ed25519 keys
Protect private keys: Set permissions to 600 and never share them
Use passphrases: Add an extra layer of protection to private keys
Rotate keys periodically: Generate new keys and remove old ones

Server Configuration Best Practices

Disable password authentication in /etc/ssh/sshd_config:
```
PasswordAuthentication no
```
Limit user access by specifying allowed users:
```
AllowUsers ec2-user
```
Change the default SSH port to reduce automated attacks:
```
Port 2222  # Example alternative port
```
Use security groups/firewalls to restrict IP ranges that can access SSH
Keep your system updated with regular security patches:
```
sudo yum update -y
```

Conclusion

You've now set up a secure remote server access system using multiple SSH keys and enhanced protection with fail2ban. This approach provides:

Strong authentication without passwords
Multiple access credentials that can be managed separately
Protection against brute force attacks
Simplified connection through SSH config

This knowledge forms the foundation of secure server administration and can be applied to any Linux-based server, not just AWS EC2 instances.

Additional Resources

Github repo [https://github.com/kaalpanikh/ssh-remote-server-setup]

Have you set up remote SSH access before? What other security measures do you implement on your servers? Share your experience in the comments below!

n8n on Azure Kubernetes Service: Benefits and Advanced Enhancements

Nikhil Mishra — Sun, 02 Mar 2025 09:24:00 GMT

This is Part 8 of the "Building a Production-Ready n8n Workflow Automation Platform on Azure Kubernetes Service" series. View the complete series here.

Conclusion and Next Steps

Welcome to the final part of our n8n on AKS series! Throughout the previous seven articles, we've built a complete production-grade n8n deployment. Let's summarize what we've accomplished and explore future possibilities.

Project Summary

In this blog post, we've walked through the complete process of deploying a production-ready n8n workflow automation platform on Azure Kubernetes Service (AKS). Let's recap what we've accomplished:

Established First Principles: We started by understanding the fundamental requirements of a production workflow system: data persistence, execution reliability, security, scalability, and maintainability.
Designed a Robust Architecture: Using these principles, we designed a comprehensive architecture with distinct layers:
- Data Layer (PostgreSQL and Redis)
- Application Layer (n8n main and workers)
- External Access Layer (Ingress and SSL/TLS)
Implemented Core Components:
- AKS cluster with proper resource allocation
- PostgreSQL database with persistence and proper user access
- Redis queue for reliable workflow distribution
- n8n main instance for UI and API access
- Worker nodes for distributed workflow execution
Secured the Deployment:
- Kubernetes secrets for sensitive credentials
- SSL/TLS encryption with automatic certificate management
- Proper service isolation and network security
Added Production Features:
- Horizontal scaling for worker nodes
- Monitoring and alerting setup
- Backup and disaster recovery procedures
- Maintenance and update strategies
Provided Troubleshooting Guidance:
- Common issues and resolution approaches
- Diagnostic procedures for each component
- Tools and scripts for efficient problem-solving

Architecture Benefits

Our implementation provides several key benefits:

Scalability

Horizontal Scaling: Worker nodes automatically scale based on demand
Resource Efficiency: Components scaled according to their specific needs
Growth Potential: Architecture can handle increasing workflow complexity and volume

Reliability

High Availability: Multiple nodes prevent single points of failure
Resilient Execution: Queue-based processing ensures workflows run reliably
Data Durability: Persistent storage with backup strategies

Security

Encrypted Communication: SSL/TLS for all external traffic
Secure Credentials: Kubernetes secrets for sensitive data
Isolation: Proper network controls and service separation

Maintainability

Kubernetes Native: Leveraging Kubernetes features for updates and rollbacks
Monitoring Integration: Comprehensive visibility into system health
Documented Procedures: Clear processes for common maintenance tasks

Business Value

This n8n deployment delivers significant business value:

Automation Capabilities: Enables complex workflow automation across various business systems
Reduced Manual Work: Eliminates repetitive tasks through reliable automation
Integration Hub: Connects disparate systems without custom development
Data Control: Self-hosted solution keeps sensitive data within your control
Cost Efficiency: Right-sized infrastructure with optimization strategies
Scalable Foundation: Grows with your automation needs

Key Metrics and Performance

Our n8n deployment achieves impressive performance metrics:

Worker Scalability: 1-5 worker nodes based on demand
Concurrent Workflows: Support for 50+ concurrent workflow executions
Database Performance: Optimized PostgreSQL capable of handling 1000+ workflow definitions
API Responsiveness: Sub-100ms response times for UI and API operations
High Availability: 99.9% uptime through redundant components

Cost Analysis

The deployed solution maintains a balance between performance and cost:

Component	Monthly Cost
AKS Nodes (2 × D2s v3)	$140.16
Storage (Premium SSD, 64GB)	$10.44
Networking (Load Balancer)	$23.00
Monitoring	$7.50
Backups	$5.20
Total	$186.30

These costs could be further optimized for development or testing environments.

Next Steps and Further Improvements

While our implementation is production-ready, several enhancements could be considered:

1. Advanced Security Features

Azure AD Integration: Add Azure Active Directory integration for n8n authentication
Private Endpoints: Configure private endpoints for Azure resources
Network Policies: Implement Kubernetes network policies for granular traffic control
Secret Rotation: Set up automated rotation of database and encryption credentials

2. Enhanced Scalability

Global Distribution: Deploy across multiple regions for geographic redundancy
Read Replicas: Add PostgreSQL read replicas for query-heavy workflows
Specialized Node Pools: Create dedicated node pools for specific workflow types

3. Operational Improvements

Automated Testing: Implement CI/CD pipelines for n8n workflows
Custom Metrics: Develop workflow-specific metrics and dashboards
Cost Optimization: Further refine resource allocation based on usage patterns
Chaos Testing: Conduct chaos engineering exercises to improve resilience

4. Integration Enhancements

Managed Identity: Use Azure Managed Identity for secure Azure service connections
API Management: Add Azure API Management for better API governance
Logic Apps Bridge: Create bridges between n8n and Azure Logic Apps for hybrid workflows

Conclusion

Congratulations! You've completed this comprehensive journey to deploying n8n on Azure Kubernetes Service. You now have a production-ready workflow automation platform that is scalable, reliable, secure, and maintainable.

I hope this series has provided valuable insights not just into n8n and AKS specifically, but also into the broader principles of designing and implementing production systems on Kubernetes. The approach we've taken—starting from first principles and building up a complete solution—can be applied to many other applications and scenarios.

Thank you for following along! If you have questions or want to share your own experiences with n8n or Kubernetes deployments, please leave a comment below.

Did you find this series helpful? Consider sharing it with colleagues who might benefit from this knowledge.

What workflow automation use cases are you implementing or planning to implement with n8n? What other tools would you like to see deployed on Kubernetes using this approach? Share your thoughts in the comments!

Troubleshooting n8n on Kubernetes: Problems and Solutions Guide

Nikhil Mishra — Sun, 02 Mar 2025 09:17:25 GMT

This is Part 7 of the "Building a Production-Ready n8n Workflow Automation Platform on Azure Kubernetes Service" series. View the complete series here.

Troubleshooting and Problem Resolution

Welcome to Part 7 of our n8n on AKS series! In Part 6, we implemented monitoring and optimization strategies. Even with the best preparation, issues can arise, so today we'll explore comprehensive troubleshooting techniques.

Even the most carefully designed systems encounter issues. This section provides a comprehensive guide to troubleshooting common problems you might encounter with your n8n deployment on AKS.

Common Issues and Resolutions

Database Connection Issues

Symptoms:

n8n pods showing errors like Error: connect ETIMEDOUT or Error: connect ECONNREFUSED
Database-related error messages in n8n logs
n8n UI showing database connection errors

Diagnostic Approach:

Check PostgreSQL pod status:

kubectl get pods -n n8n -l app=postgres

Verify PostgreSQL service:

kubectl get svc postgres-service -n n8n

Check database logs:

kubectl logs $(kubectl get pod -l app=postgres -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n

Test database connection from n8n pod:

kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  node -e "const { Pool } = require('pg'); const pool = new Pool({host: 'postgres-service', user: process.env.DB_POSTGRESDB_USER, password: process.env.DB_POSTGRESDB_PASSWORD, database: 'n8n'}); pool.query('SELECT NOW()', (err, res) => { console.log(err || res.rows[0]); pool.end(); })"

Common Resolutions:

Authentication Issues:
- Verify the PostgreSQL credentials in the Kubernetes secrets
- Ensure the n8n database user exists and has proper permissions
Network Issues:
- Check if pods are in the same namespace
- Verify that the service name resolution works
- Ensure no network policies are blocking the connection
Database Health Issues:
- Check for PostgreSQL resource constraints
- Verify the database isn't in recovery mode
- Check for disk space issues

Queue/Redis Connection Issues

Symptoms:

Workflows are triggered but stay in "waiting" status
Error messages like Error connecting to Redis in n8n logs
Workers not processing queued workflows

Diagnostic Approach:

Check Redis pod status:
```
kubectl get pods -n n8n -l app=redis
```
Verify Redis service:
```
kubectl get svc redis-service -n n8n
```

Check Redis logs:

kubectl logs $(kubectl get pod -l app=redis -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n

Test Redis connection from n8n pod:

kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  node -e "const Redis = require('ioredis'); const redis = new Redis('redis-service'); redis.ping().then(res => { console.log(res); redis.disconnect(); })"

Common Resolutions:

Connection Configuration:
- Verify Redis host and port settings in n8n environment variables
- Check if the Redis service name is correctly specified
Queue Stuck Issues:
- Clear stuck queues with Redis CLI commands
- Restart the Redis pod if necessary
Worker Configuration:
- Ensure workers are configured for queue mode
- Verify workers have the same encryption key as the main n8n instance

SSL/TLS Certificate Issues

Symptoms:

Browser shows "Your connection is not private" warning
Certificate errors in browser console
Ingress controller logs showing certificate issues

Diagnostic Approach:

Check certificate status:
```
kubectl get certificate -n n8n
```

Examine certificate details:

kubectl describe certificate n8n-tls-secret -n n8n

Check cert-manager logs:

kubectl logs -n cert-manager -l app=cert-manager

Verify the ClusterIssuer status:

kubectl describe clusterissuer letsencrypt-prod

Common Resolutions:

Domain Validation Issues:
- Ensure DNS is correctly configured to point to your ingress controller IP
- Verify that the HTTP-01 challenge can reach your ingress controller
- Check if Let's Encrypt rate limits have been hit (5 certificates per domain per week)
Configuration Issues:
- Verify email address in ClusterIssuer is valid
- Ensure ingress class is correctly specified
- Check TLS section in Ingress resource matches your domain
Certificate Renewal Issues:
- Manually trigger certificate renewal if needed
- Check if cert-manager CRDs are up to date
- Verify cert-manager has necessary permissions

Ingress and External Access Issues

Symptoms:

Unable to access n8n UI from the internet
404, 502, or other HTTP errors when accessing your domain
Timeouts when attempting to connect

Diagnostic Approach:

Check Ingress resource status:
```
kubectl get ingress -n n8n
```

Verify Ingress controller pods:

kubectl get pods -n default -l app.kubernetes.io/name=ingress-nginx

Check Ingress controller logs:

kubectl logs -n default -l app.kubernetes.io/name=ingress-nginx

Test connectivity to Ingress IP:
```
curl -v http://
```

Common Resolutions:

DNS Issues:
- Verify DNS A record points to the correct Ingress Controller IP
- Check if DNS propagation is complete (may take up to 48 hours)
- Use nslookup or dig to verify DNS resolution
Ingress Configuration:
- Ensure the Ingress resource specifies the correct service and port
- Verify host rules match your domain exactly
- Check path settings and ensure they match n8n requirements
Network Issues:
- Verify Azure Network Security Groups allow traffic on ports 80 and 443
- Check if any firewalls are blocking access to your AKS cluster
- Ensure the Ingress Controller service is of type LoadBalancer with an external IP

Troubleshooting Decision Tree

The following diagram presents a structured approach to troubleshooting n8n deployment issues:

flowchart TD
    start[Issue Detected] --> issue{What type of issue?}

    issue -->|UI/Access| ui[UI or Access Issue]
    issue -->|Workflow Execution| exec[Workflow Execution Issue]
    issue -->|Infrastructure| infra[Infrastructure Issue]

    ui --> uiDiag{UI Diagnostic}
    uiDiag -->|Cannot reach site| dns[Check DNS & Ingress]
    uiDiag -->|SSL Error| cert[Check Certificate]
    uiDiag -->|Login Issue| auth[Check Authentication]
    uiDiag -->|UI Loads but Errors| n8nui[Check n8n Logs]

    dns --> dnsFix[DNS & Ingress Fixes]
    cert --> certFix[Certificate Fixes]
    auth --> authFix[Auth Fixes]
    n8nui --> uiFix[UI Issue Fixes]

    exec --> execDiag{Execution Diagnostic}
    execDiag -->|Workflow Stuck in Queue| queue[Check Redis & Workers]
    execDiag -->|Database Errors| db[Check PostgreSQL]
    execDiag -->|Execution Failures| node[Check Node Errors]

    queue --> queueFix[Queue & Worker Fixes]
    db --> dbFix[Database Fixes]
    node --> nodeFix[Node-specific Fixes]

    infra --> infraDiag{Infrastructure Diagnostic}
    infraDiag -->|Pod Issues| pod[Check Pod Status]
    infraDiag -->|Resource Issues| res[Check Resource Usage]
    infraDiag -->|Network Issues| net[Check Network Policy]

    pod --> podFix[Pod Fixes]
    res --> resFix[Resource Fixes]
    net --> netFix[Network Fixes]

    style start fill:#f96,stroke:#333
    style dnsFix,certFix,authFix,uiFix fill:#9f6,stroke:#333
    style queueFix,dbFix,nodeFix fill:#9f6,stroke:#333
    style podFix,resFix,netFix fill:#9f6,stroke:#333

Advanced Diagnostic Workflows

Database Performance Issues

If workflow execution is slow or database operations are taking too long:

Check database load:

kubectl exec -it $(kubectl get pod -l app=postgres -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"

Identify long-running queries:

kubectl exec -it $(kubectl get pod -l app=postgres -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  psql -U postgres -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC;"

Check table sizes and index usage:

kubectl exec -it $(kubectl get pod -l app=postgres -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  psql -U postgres -d n8n -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) AS total_size FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;"

Memory Leak Investigation

If n8n pods are steadily increasing in memory usage:

Get memory usage metrics:
```
kubectl top pods -n n8n
```

Check container memory limit and usage:

kubectl describe pod $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n

Generate a heap dump (for advanced debugging):

kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  node --expose-gc -e "const fs=require('fs'); setTimeout(() => { global.gc(); const heapSnapshot = require('v8').getHeapSnapshot(); const file = fs.createWriteStream('/tmp/heap.json'); heapSnapshot.pipe(file); }, 1000);"

kubectl cp n8n/$(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}'):/tmp/heap.json ./heap.json

Analyze the heap dump with Chrome DevTools or a memory analyzer.

Network Connectivity Issues

If services can't communicate with each other:

Test network connectivity between pods:

kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  nc -zv postgres-service 5432

kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  nc -zv redis-service 6379

Check if network policies are restricting traffic:
```
kubectl get networkpolicies -n n8n
```

Verify DNS resolution:

kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  nslookup postgres-service

kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  nslookup redis-service

Creating a Diagnostic Information Bundle

For complex issues, it's often helpful to collect comprehensive diagnostic information:

#!/bin/bash
# collect-diagnostics.sh - Collect diagnostic information for n8n deployment

# Create output directory
mkdir -p n8n-diagnostics
cd n8n-diagnostics

# Collect pod information
kubectl get pods -n n8n -o yaml > pods.yaml
kubectl describe pods -n n8n > pods-describe.txt

# Collect logs
for pod in $(kubectl get pods -n n8n -o jsonpath='{.items[*].metadata.name}'); do
  kubectl logs $pod -n n8n > $pod-logs.txt
done

# Collect service and endpoint information
kubectl get svc,endpoints -n n8n -o yaml > services.yaml

# Collect ingress and certificate information
kubectl get ingress,certificate -n n8n -o yaml > ingress-cert.yaml
kubectl describe ingress,certificate -n n8n > ingress-cert-describe.txt

# Collect events
kubectl get events -n n8n > events.txt

# Collect resource usage
kubectl top pods -n n8n > pod-resources.txt
kubectl top nodes > node-resources.txt

# Create a tar archive
tar -czf n8n-diagnostics.tar.gz *

echo "Diagnostic information collected in n8n-diagnostics.tar.gz"

Troubleshooting Cheatsheet

Issue	Check Command	Resolution Strategy
Pod won't start	`kubectl describe pod -n n8n`	Check events section for errors
Pod crashing	`kubectl logs -n n8n`	Look for error messages near the end
Service unavailable	`kubectl get endpoints -n n8n`	Verify endpoints exist
Certificate issues	`kubectl describe certificate -n n8n`	Check events and conditions
Database connection	`kubectl exec -it -n n8n -- env \	grep DB_`	Verify environment variables
Redis connection	`kubectl exec -it -n n8n -- env \	grep REDIS`	Verify environment variables
Ingress not working	`kubectl get ingress -n n8n`	Check ADDRESS field has an IP
Resource constraints	`kubectl top pods -n n8n`	Check for pods near resource limits
Webhook not triggering	`kubectl logs -n n8n \	grep webhook`	Verify webhook URL and connectivity

Summary

Troubleshooting a production n8n deployment on AKS requires a systematic approach. By understanding:

Common failure modes and their symptoms
Diagnostic approaches for each component
Resolution strategies for different issues

You can quickly identify and resolve problems, minimizing downtime and ensuring a reliable workflow automation platform.

Remember that many issues can be prevented through proper monitoring and proactive maintenance, as discussed in the previous section. When problems do occur, having these troubleshooting procedures documented will significantly reduce the mean time to resolution.

Conclusion

Armed with these troubleshooting strategies and diagnostic workflows, you can quickly identify and resolve issues in your n8n deployment. Remember that systematic investigation and a good understanding of the architecture are key to efficient problem resolution.

In our final article, we'll summarize what we've accomplished, review the benefits of our architecture, and explore advanced enhancements for the future. [Continue to Part 8: Conclusion and Next Steps]

What troubleshooting techniques have you found most effective for Kubernetes applications? Have you encountered any particularly challenging issues with workflow automation systems? Share your stories in the comments!

Monitoring and Optimizing n8n on Kubernetes: The Complete Guide

Nikhil Mishra — Sun, 02 Mar 2025 09:11:59 GMT

This is Part 6 of the "Building a Production-Ready n8n Workflow Automation Platform on Azure Kubernetes Service" series. View the complete series here.

Monitoring, Maintenance, and Optimization

A production-grade deployment requires robust monitoring, routine maintenance procedures, and performance optimization. In this section, we'll cover:

Monitoring strategies for n8n on AKS
Maintenance procedures and best practices
Performance optimization techniques
Cost optimization approaches

Monitoring Your n8n Deployment

Key Metrics to Monitor

For an n8n deployment, several metrics are critical to track:

Application Health:
- Pod readiness and liveness
- API response times
- Error rates in logs
- Webhook reliability
Infrastructure Metrics:
- CPU and memory usage across all components
- Storage usage and growth rate
- Network traffic patterns
- Queue length and processing times
Database Performance:
- Query execution times
- Connection pool utilization
- Database size growth
- Transaction rates

Implementing Azure Monitor

Azure Monitor provides comprehensive monitoring for AKS clusters. We implemented it with:

# Enable Azure Monitor for container insights
az aks enable-addons -a monitoring -n n8n-cluster -g n8n-aks-rg

This enables:

Container metrics collection
Log aggregation
Performance dashboards
Alert configuration

Creating Custom Dashboards

We created custom dashboards in Azure portal for n8n-specific metrics:

n8n Operations Dashboard:
- Workflow execution rates
- Error percentages
- API request volumes
- Active user sessions
Infrastructure Health Dashboard:
- Pod status across namespaces
- Node resource utilization
- Storage consumption
- Networking metrics

Setting Up Alerts

Critical alerts were configured for:

High Severity:
- Any pod in Failed or CrashLoopBackOff state
- Database or Redis unavailability
- Worker queue backlog exceeding thresholds
- Certificate expiration warnings
Medium Severity:
- CPU or memory usage above 80% for over 15 minutes
- Persistent storage approaching capacity
- High error rates in application logs
- Unusual traffic patterns (potential security issues)

Log Management

For comprehensive log management, we configured:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: kube-system
data:
  fluent.conf: |
    # Log collection and forwarding configuration
    # Details omitted for brevity

This configuration:

Collects container logs across the cluster
Enriches logs with metadata (namespace, pod name, etc.)
Forwards logs to Azure Log Analytics
Enables structured querying and analytics

Maintenance Procedures

Backup and Disaster Recovery

We implemented a comprehensive backup strategy:

Database Backups:
- Daily full backups retained for 30 days
- Point-in-time recovery capability
- Geo-redundant storage for backups
- Automated validation of backup integrity

Implementation using a CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: n8n
spec:
  schedule: "0 2 * * *"  # Run daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: postgres-backup
            image: postgres:13
            command: ["/bin/bash", "-c"]
            args:
            - |
              pg_dump -h postgres-service -U n8n -d n8n | gzip > /backups/n8n-$(date +%Y%m%d).sql.gz
              # Upload to Azure Blob Storage
              az storage blob upload --account-name n8nbackups --container-name backups --name n8n-$(date +%Y%m%d).sql.gz --file /backups/n8n-$(date +%Y%m%d).sql.gz
            volumeMounts:
            - name: backup-volume
              mountPath: /backups
            env:
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: n8n-secret
                  key: DB_POSTGRESDB_PASSWORD
          volumes:
          - name: backup-volume
            emptyDir: {}
          restartPolicy: OnFailure

Disaster Recovery Plan:
- Documented recovery procedures
- Regular DR testing (quarterly)
- Recovery time objective (RTO): 2 hours
- Recovery point objective (RPO): 24 hours

Update Strategy

For keeping the deployment up-to-date, we established:

n8n Version Updates:
- Monthly update schedule
- Canary deployment approach (update one pod, validate, then update others)
- Rollback procedures documented and tested
Kubernetes and Infrastructure Updates:
- Quarterly AKS version assessment
- Security patches applied promptly
- Node recycling strategy (one node at a time)

Update implementation with zero-downtime:

# Update n8n with rolling deployment
kubectl set image deployment/n8n n8n=n8nio/n8n:new-version -n n8n

# Wait for rollout to complete
kubectl rollout status deployment/n8n -n n8n

# If issues detected, rollback
kubectl rollout undo deployment/n8n -n n8n

Maintenance PowerShell Script

We created a maintenance PowerShell script for routine operations:

# manage-n8n.ps1 - Common management operations

param(
    [Parameter(Mandatory=$true)]
    [ValidateSet("status", "logs", "restart", "scale", "backup")]
    [string]$Operation,

    [Parameter(Mandatory=$false)]
    [string]$Component = "n8n",

    [Parameter(Mandatory=$false)]
    [int]$Replicas = 0
)

# Script implementation omitted for brevity
# See full script in the repository

This script simplifies common maintenance tasks and ensures consistent procedures.

Performance Optimization

Resource Tuning

Based on performance monitoring, we optimized resource allocations:

n8n Workers:
- Increased memory allocation to 1.5Gi for complex workflows
- Fine-tuned CPU requests based on actual usage patterns
- Adjusted HPA thresholds to scale earlier
PostgreSQL:
- Optimized shared_buffers and work_mem settings
- Implemented connection pooling with PgBouncer
- Added indexes for frequently queried fields

Implementation for PostgreSQL tuning:

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-config
  namespace: n8n
data:
  postgresql.conf: |
    shared_buffers = 256MB
    work_mem = 16MB
    maintenance_work_mem = 64MB
    effective_cache_size = 768MB
    max_connections = 100
    # Additional optimized settings omitted for brevity

n8n Configuration Optimization

We fine-tuned n8n configuration based on production usage patterns:

Workflow Execution Settings:
- Adjusted EXECUTIONS_PROCESS for optimal resource usage
- Configured execution timeout parameters for long-running workflows
- Optimized retry mechanisms for external service connections
Queue Management:
- Implemented queue priority settings for critical workflows
- Configured dedicated queues for different workflow types
- Optimized job concurrency settings per worker

Cost Optimization

Resource Right-Sizing

We implemented several cost optimization strategies:

Node Pools and VM Sizing:
- Used Azure Spot Instances for worker nodes (50-80% cost savings)
- Implemented node auto-scaling to reduce idle capacity
- Right-sized VM types based on actual usage patterns
Storage Optimization:
- Implemented log retention policies
- Used premium storage only for performance-critical components
- Set up automatic storage cleanup for temporary data

Cost Analysis

We conducted a comprehensive cost analysis:

Monthly Cost Breakdown:
- AKS Nodes (2 x D2s v3): $140.16
- Storage (Premium SSD, 64 GB): $10.44
- Networking (Load Balancer, Outbound): $23.00
- Monitoring: $7.50
- Backups: $5.20
----------------------------------
Total Estimated Monthly Cost: $186.30

Cost optimization reduced the original estimate by approximately 30%.

Operational Architecture

The complete operational architecture with monitoring components can be visualized as:

flowchart TB
    subgraph "Azure AKS Cluster"
        subgraph "n8n Workloads"
            n8n["n8n Main"]
            workers["n8n Workers"]
            pg["PostgreSQL"]
            redis["Redis"]
        end

        subgraph "Monitoring"
            azm["Azure Monitor"]
            la["Log Analytics"]
            ai["Application Insights"]
        end

        subgraph "Operations"
            backup["Backup CronJob"]
            hpa["HPA Controller"]
        end
    end

    subgraph "Azure Services"
        storage["Azure Storage\n(Backups)"]
        alerts["Azure Alerts"]
        dashboard["Azure Dashboard"]
    end

    n8n --> azm
    workers --> azm
    pg --> azm
    redis --> azm

    azm --> la
    la --> ai

    backup --> pg
    backup --> storage
    hpa --> workers

    azm --> alerts
    la --> dashboard

    style azm fill:#f9f,stroke:#333
    style la fill:#f9f,stroke:#333
    style ai fill:#f9f,stroke:#333
    style backup fill:#ff9,stroke:#333
    style hpa fill:#ff9,stroke:#333

Health Checks and Validation

Comprehensive Health Check Script

We created a comprehensive health check script to verify all components:

#!/bin/bash
# health-check.sh - Verify all components of n8n deployment

echo "Checking pod status..."
kubectl get pods -n n8n

echo "Checking service endpoints..."
kubectl get endpoints -n n8n

echo "Checking certificate status..."
kubectl get certificate -n n8n

echo "Checking database connection..."
kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  node -e "const { Pool } = require('pg'); const pool = new Pool({connectionString: process.env.DB_POSTGRESDB_URL}); pool.query('SELECT NOW()', (err, res) => { console.log(err || res.rows[0]); pool.end(); })"

echo "Checking Redis connection..."
kubectl exec -it $(kubectl get pod -l app=n8n -n n8n -o jsonpath='{.items[0].metadata.name}') -n n8n -- \
  node -e "const Redis = require('ioredis'); const redis = new Redis(process.env.QUEUE_BULL_REDIS_HOST); redis.ping().then(res => { console.log(res); redis.disconnect(); })"

echo "Checking external access..."
curl -I https://n8n.behooked.co

This script provides a quick way to validate all aspects of the deployment.

Conclusion

With our monitoring, maintenance, and optimization strategies in place, our n8n deployment is truly production-ready. We can proactively identify issues, maintain system health, and optimize resources for both performance and cost efficiency.

In the next article, we'll explore comprehensive troubleshooting approaches for common issues you might encounter with your n8n deployment. [Continue to Part 7: Troubleshooting Guide]

What monitoring tools have you found most effective for Kubernetes workloads? Are there specific metrics you focus on for workflow automation systems? Share your experiences in the comments!

Securing n8n on Kubernetes: Ingress, SSL/TLS, and Best Practices

Nikhil Mishra — Sun, 02 Mar 2025 09:03:47 GMT

This is Part 5 of the "Building a Production-Ready n8n Workflow Automation Platform on Azure Kubernetes Service" series. View the complete series here.

Configuring External Access and SSL/TLS

With our n8n application successfully deployed, we need to make it securely accessible from the internet. This involves:

Setting up an Ingress resource to route traffic to n8n
Implementing SSL/TLS encryption for secure communication
Configuring DNS for external access

Let's implement these components to complete our production deployment.

Implementing Cert-Manager for SSL/TLS

Why SSL/TLS is Critical

For a production workflow automation system, SSL/TLS encryption is essential because:

It protects sensitive data transmitted between clients and n8n
It prevents man-in-the-middle attacks
It builds trust with users and external services
It's required for many modern browser features
It's a prerequisite for compliance with security standards

Installing Cert-Manager

We used cert-manager to automate the issuance and renewal of SSL/TLS certificates from Let's Encrypt:

# Add the Jetstack Helm repository
helm repo add jetstack https://charts.jetstack.io
helm repo update

# Install cert-manager with CRDs
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true

After installation, we verified cert-manager was running correctly:

kubectl get pods -n cert-manager

Expected output:

NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-xxxxxxxxx-xxxxx               1/1     Running   0          2m
cert-manager-cainjector-xxxxxxxxx-xxxxx    1/1     Running   0          2m
cert-manager-webhook-xxxxxxxxx-xxxxx       1/1     Running   0          2m

Configuring a ClusterIssuer for Let's Encrypt

With cert-manager installed, we created a ClusterIssuer resource to integrate with Let's Encrypt:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    # The ACME server URL for Let's Encrypt production
    server: https://acme-v02.api.letsencrypt.org/directory
    # Email address for Important account notifications
    email: your-email@example.com
    # Name of a secret used to store the ACME account private key
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    # Enable the HTTP-01 challenge provider
    solvers:
    - http01:
        ingress:
          class: nginx

This configuration:

Uses Let's Encrypt's production ACME server
Specifies your email for notifications about certificate expiry
Uses the HTTP-01 challenge method for domain validation
Associates with our nginx ingress controller

We applied this configuration:

kubectl apply -f cluster-issuer.yaml

Ingress Configuration with SSL/TLS

Setting Up DNS

Before configuring the Ingress, we created a DNS A record pointing to the external IP of our NGINX Ingress Controller:

Record Type	Name	Value	TTL
A	n8n.behooked.co	74.179.239.172	3600

Note: Use your actual domain and the external IP from your NGINX Ingress Controller.

Creating the Ingress Resource

Now we can create an Ingress resource to route external traffic to our n8n service and configure SSL/TLS:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: n8n-ingress
  namespace: n8n
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
spec:
  tls:
  - hosts:
    - n8n.behooked.co
    secretName: n8n-tls-secret
  rules:
  - host: n8n.behooked.co
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: n8n
            port:
              number: 5678

Key aspects of this configuration:

References our ClusterIssuer to automatically obtain a certificate
Enables SSL redirection to force HTTPS
Increases the allowed body size for uploading files to n8n
Routes all traffic for our domain to the n8n service
Specifies the TLS secret where the certificate will be stored

We applied this configuration:

kubectl apply -f n8n-ingress.yaml

Verifying Certificate Issuance

After applying the Ingress resource, cert-manager automatically requests a certificate from Let's Encrypt. We checked the status with:

kubectl get certificate -n n8n

Expected output when successful:

NAME             READY   SECRET           AGE
n8n-tls-secret   True    n8n-tls-secret   3m

If the READY status is "False", we can check for issues:

kubectl describe certificate n8n-tls-secret -n n8n

External Access Flow

The following diagram illustrates how external requests flow through our system:

sequenceDiagram
    participant User as User
    participant DNS as DNS (n8n.behooked.co)
    participant Ingress as NGINX Ingress
    participant CM as Cert-Manager
    participant LE as Let's Encrypt
    participant n8n as n8n Service

    User->>DNS: Request n8n.behooked.co
    DNS-->>User: Resolve to Ingress IP

    User->>Ingress: HTTPS Request

    alt First Request / Certificate Issuance
        Ingress->>CM: Check for certificate
        CM->>LE: Request certificate
        LE->>CM: Challenge for domain validation
        CM->>Ingress: Configure validation endpoint
        LE->>Ingress: Validate domain ownership
        LE-->>CM: Issue certificate
        CM-->>Ingress: Store certificate
    end

    Ingress->>Ingress: TLS Termination
    Ingress->>n8n: Forward request
    n8n-->>Ingress: Response
    Ingress-->>User: Encrypted response

This process provides:

Automatic certificate issuance and renewal
End-to-end encryption for all external traffic
Simplified certificate management

External Access Architecture

The complete external access architecture can be visualized as:

flowchart LR
    users[("Internet Users")]
    domain["Domain:\nn8n.behooked.co"]

    subgraph "Azure AKS"
        ingress["NGINX Ingress\nController"]
        cert["Cert-Manager"]

        subgraph "n8n Namespace"
            n8nSvc["n8n Service"]
            n8nPod["n8n Pod"]
        end

        ingress -->|"HTTPS\nPort 443"| n8nSvc
        n8nSvc --> n8nPod
        cert -.->|"Manages\nCertificates"| ingress
    end

    users -->|"HTTPS\nPort 443"| domain
    domain -->|"A Record\n74.179.239.172"| ingress

    style ingress fill:#f96,stroke:#333
    style cert fill:#ff9,stroke:#333
    style n8nSvc fill:#9cf,stroke:#333
    style n8nPod fill:#9fc,stroke:#333

Security Considerations

Our external access implementation includes several security enhancements:

Force HTTPS: All HTTP requests are automatically redirected to HTTPS
Modern TLS: Let's Encrypt provides modern TLS certificates with strong encryption
Automatic Renewal: Certificates are renewed automatically before they expire
Rate Limiting: Can be configured on the Ingress to prevent abuse

Validation

To validate our external access configuration, we performed several checks:

1. Ingress Status

kubectl get ingress -n n8n

Expected output:

NAME          CLASS    HOSTS              ADDRESS          PORTS     AGE
n8n-ingress      n8n.behooked.co    74.179.239.172   80, 443   5m

2. Certificate Status

kubectl get certificate -n n8n

Expected output:

NAME             READY   SECRET           AGE
n8n-tls-secret   True    n8n-tls-secret   5m

3. Browser Access

We accessed https://n8n.behooked.co in a browser to verify:

The site loads correctly
The connection is secure (padlock icon)
The certificate is valid and issued by Let's Encrypt

4. Certificate Details

We also examined the certificate details in the browser to confirm:

The correct domain name
Valid issue and expiry dates
Let's Encrypt as the Certificate Authority

Conclusion

Our n8n deployment is now securely accessible from the internet with HTTPS encryption, thanks to our Ingress configuration and Let's Encrypt integration. Users can safely access the n8n interface and external systems can securely connect to webhooks.

In the next article, we'll implement monitoring, maintenance procedures, and optimization techniques to ensure our deployment remains healthy and efficient. [Continue to Part 6: Monitoring and Optimization]

What challenges have you faced when implementing SSL/TLS for your Kubernetes applications? Have you used Let's Encrypt or other certificate providers? Share your insights in the comments!

Scaling n8n with Queue Mode on Kubernetes: Worker Deployment Guide

Nikhil Mishra — Sun, 02 Mar 2025 08:54:48 GMT

This is Part 4 of the "Building a Production-Ready n8n Workflow Automation Platform on Azure Kubernetes Service" series. View the complete series here.

Implementing the Application Layer

With our data layer in place, we can now implement the application layer of our n8n deployment. This consists of two main components:

n8n Main: The primary n8n instance that serves the UI and API
n8n Workers: Dedicated execution nodes that process workflows

Let's configure these components for a production-ready deployment.

n8n Configuration and Environment Variables

Before deploying n8n, we need to understand the key configuration options available.

Key Environment Variables

n8n can be configured through numerous environment variables. The most important ones for our deployment are:

Variable	Description	Value for Our Setup
`DB_TYPE`	Database type	`postgresdb`
`DB_POSTGRESDB_HOST`	PostgreSQL hostname	`postgres-service`
`DB_POSTGRESDB_PORT`	PostgreSQL port	`5432`
`DB_POSTGRESDB_DATABASE`	Database name	`n8n`
`DB_POSTGRESDB_USER`	Database user	`n8n`
`DB_POSTGRESDB_PASSWORD`	Database password	`[secured]`
`EXECUTIONS_MODE`	Execution mode	`queue`
`QUEUE_BULL_REDIS_HOST`	Redis hostname	`redis-service`
`QUEUE_BULL_REDIS_PORT`	Redis port	`6379`
`N8N_ENCRYPTION_KEY`	Encryption key for credentials	`[secured]`
`WEBHOOK_TUNNEL_URL`	External webhook URL	`https://n8n.yourdomain.com`

n8n Secrets

We created Kubernetes secrets to store sensitive values:

apiVersion: v1
kind: Secret
metadata:
  name: n8n-secret
  namespace: n8n
type: Opaque
data:
  N8N_ENCRYPTION_KEY: YVZ2UnlSeXdWN1VjWjAzcWdzQWJQUWY0U1ZCV1Y0bWg=  # base64 encoded random string
  WEBHOOK_TUNNEL_URL: aHR0cHM6Ly9uOG4uYmVob29rZWQuY28=  # base64 encoded URL
  DB_POSTGRESDB_USER: bjhu  # base64 encoded "n8n"
  DB_POSTGRESDB_PASSWORD: c2VjdXJlLXBhc3N3b3JkLWhlcmU=  # base64 encoded password

Note: Always generate a strong random string for the encryption key, as it's used to encrypt credentials stored in the database.

Main n8n Deployment

The main n8n deployment serves the web UI and API, handling user requests and enqueueing workflows for execution.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: n8n
  namespace: n8n
spec:
  replicas: 1
  selector:
    matchLabels:
      app: n8n
  template:
    metadata:
      labels:
        app: n8n
    spec:
      containers:
      - name: n8n
        image: n8nio/n8n:latest
        ports:
        - containerPort: 5678
        env:
        - name: DB_TYPE
          value: "postgresdb"
        - name: DB_POSTGRESDB_HOST
          value: "postgres-service"
        - name: DB_POSTGRESDB_PORT
          value: "5432"
        - name: DB_POSTGRESDB_DATABASE
          value: "n8n"
        - name: DB_POSTGRESDB_USER
          valueFrom:
            secretKeyRef:
              name: n8n-secret
              key: DB_POSTGRESDB_USER
        - name: DB_POSTGRESDB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: n8n-secret
              key: DB_POSTGRESDB_PASSWORD
        - name: N8N_ENCRYPTION_KEY
          valueFrom:
            secretKeyRef:
              name: n8n-secret
              key: N8N_ENCRYPTION_KEY
        - name: WEBHOOK_TUNNEL_URL
          valueFrom:
            secretKeyRef:
              name: n8n-secret
              key: WEBHOOK_TUNNEL_URL
        - name: EXECUTIONS_MODE
          value: "queue"
        - name: QUEUE_BULL_REDIS_HOST
          value: "redis-service"
        - name: QUEUE_BULL_REDIS_PORT
          value: "6379"
        resources:
          requests:
            memory: "512Mi"
            cpu: "300m"
          limits:
            memory: "1Gi"
            cpu: "600m"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 5678
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /healthz
            port: 5678
          initialDelaySeconds: 20
          periodSeconds: 5

Key aspects of this configuration:

Uses the official n8n Docker image
Configures database and Redis connection details
Sets the execution mode to queue
Configures resource limits
Adds health checks for container reliability

n8n Service

We exposed the n8n deployment through a Kubernetes service:

apiVersion: v1
kind: Service
metadata:
  name: n8n
  namespace: n8n
spec:
  selector:
    app: n8n
  ports:
  - port: 5678
    targetPort: 5678
  type: ClusterIP

This service allows the Ingress controller to route traffic to n8n.

n8n Worker Deployment

One of the key advantages of our architecture is the separation of the n8n UI/API from the workflow execution. This is achieved through dedicated worker nodes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: n8n-worker
  namespace: n8n
spec:
  replicas: 2
  selector:
    matchLabels:
      app: n8n-worker
  template:
    metadata:
      labels:
        app: n8n-worker
    spec:
      containers:
      - name: n8n-worker
        image: n8nio/n8n:latest
        command: ["n8n", "worker"]
        env:
        - name: DB_TYPE
          value: "postgresdb"
        - name: DB_POSTGRESDB_HOST
          value: "postgres-service"
        - name: DB_POSTGRESDB_PORT
          value: "5432"
        - name: DB_POSTGRESDB_DATABASE
          value: "n8n"
        - name: DB_POSTGRESDB_USER
          valueFrom:
            secretKeyRef:
              name: n8n-secret
              key: DB_POSTGRESDB_USER
        - name: DB_POSTGRESDB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: n8n-secret
              key: DB_POSTGRESDB_PASSWORD
        - name: N8N_ENCRYPTION_KEY
          valueFrom:
            secretKeyRef:
              name: n8n-secret
              key: N8N_ENCRYPTION_KEY
        - name: EXECUTIONS_MODE
          value: "queue"
        - name: QUEUE_BULL_REDIS_HOST
          value: "redis-service"
        - name: QUEUE_BULL_REDIS_PORT
          value: "6379"
        - name: QUEUE_BULL_REDIS_PREFIX
          value: "bull"
        resources:
          requests:
            memory: "512Mi"
            cpu: "300m"
          limits:
            memory: "1Gi"
            cpu: "800m"

The key differences from the main deployment are:

Command set to n8n worker to run in worker mode
Multiple replicas for parallel execution
Slightly different resource allocation optimized for workflow execution
No ports exposed (workers don't need to be accessible externally)

Horizontal Pod Autoscaler for Workers

To handle varying workflow loads efficiently, we implemented autoscaling for the worker nodes:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: n8n-worker-hpa
  namespace: n8n
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: n8n-worker
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

This HPA scales the worker deployment based on CPU utilization:

Scales up when CPU utilization exceeds 70%
Minimum of 1 worker replica (to save resources during idle periods)
Maximum of 5 worker replicas (to handle peak loads)

Queue Mode Architecture

The following diagram illustrates how the queue mode works in our architecture:

sequenceDiagram
    participant Client
    participant n8n as n8n Main
    participant Redis
    participant Worker as n8n Worker
    participant DB as PostgreSQL

    Client->>n8n: Trigger workflow
    n8n->>DB: Get workflow definition
    n8n->>Redis: Enqueue workflow execution
    n8n-->>Client: Acknowledge trigger

    loop For each worker
        Worker->>Redis: Poll for new jobs
        Redis-->>Worker: Return job if available
        Worker->>DB: Get workflow details
        Worker->>Worker: Execute workflow
        Worker->>DB: Store execution results
    end

    Client->>n8n: Check execution status
    n8n->>DB: Retrieve execution results
    n8n-->>Client: Return results

This approach provides several advantages:

The main n8n instance remains responsive even during heavy workflow execution
Multiple workflows can execute in parallel across worker nodes
Workers can be scaled independently based on execution load
Workflow execution continues even if the main n8n UI is restarted

Application Layer Architecture

Our complete application layer architecture can be visualized as:

flowchart TD
    subgraph "Application Layer"
        ui["n8n Main\n(UI/API)"]

        subgraph "Worker Pool"
            w1["Worker 1"]
            w2["Worker 2"]
            w3["Worker 3 (scaled)"]
            w4["Worker 4 (scaled)"]
            w5["Worker 5 (scaled)"]
        end

        ui --> w1
        ui --> w2
        ui --> w3
        ui --> w4
        ui --> w5
    end

    subgraph "Data Layer"
        redis[("Redis Queue")]
        db[("PostgreSQL")]
    end

    ui --> redis
    ui --> db

    w1 --> redis
    w1 --> db
    w2 --> redis
    w2 --> db
    w3 --> redis
    w3 --> db
    w4 --> redis
    w4 --> db
    w5 --> redis
    w5 --> db

    client["External Client"] --> ui

    style ui fill:#f96,stroke:#333
    style w1,w2,w3,w4,w5 fill:#69f,stroke:#333
    style redis fill:#bbf,stroke:#333
    style db fill:#6b9,stroke:#333

The diagram shows:

Clear separation between UI and worker instances
Horizontal scaling capability for workers
Shared data infrastructure
Client interaction only with the main n8n instance

Validation

After deploying the application layer, we verified all components were running:

kubectl get pods -n n8n

Expected output:

NAME                          READY   STATUS    RESTARTS   AGE
n8n-xxxxxxxxx-xxxxx           1/1     Running   0          3m
n8n-worker-xxxxxxxxx-xxxxx    1/1     Running   0          3m
n8n-worker-xxxxxxxxx-xxxxx    1/1     Running   0          3m
postgres-xxxxxxxxx-xxxxx      1/1     Running   0          10m
redis-xxxxxxxxx-xxxxx         1/1     Running   0          8m

We also verified the services:

kubectl get services -n n8n

With expected output:

NAME              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
n8n               ClusterIP   10.0.xxx.xxx             5678/TCP   3m
postgres-service  ClusterIP   10.0.xxx.xxx             5432/TCP   10m
redis-service     ClusterIP   10.0.xxx.xxx             6379/TCP   8m

With our application layer successfully deployed, we can now move on to the external access layer with Ingress configuration and SSL/TLS setup.

Conclusion

We've successfully deployed the n8n application layer, including the main instance for the UI/API and worker nodes for distributed execution. Our configuration enables horizontal scaling to handle varying workload demands efficiently.

In the next article, we'll make our n8n instance securely accessible from the internet by configuring the Ingress controller and implementing SSL/TLS with Let's Encrypt. [Continue to Part 5: External Access and Security]

How are you handling workflow execution in your automation systems? Have you implemented a queue-based approach like we did here? Share your experiences in the comments!

Deploying PostgreSQL and Redis for n8n on Kubernetes: Complete Guide

Nikhil Mishra — Sun, 02 Mar 2025 08:43:08 GMT

This is Part 3 of the "Building a Production-Ready n8n Workflow Automation Platform on Azure Kubernetes Service" series. View the complete series here.

Implementing the Data Layer

A robust data layer is critical for any production workflow system. In our n8n deployment, the data layer consists of two primary components:

PostgreSQL: For persistent storage of workflows, credentials, and execution history
Redis: For queue management and workflow distribution

Let's implement each of these components with production-grade configurations.

PostgreSQL Deployment

Why PostgreSQL for n8n?

n8n stores various types of data that require a reliable, ACID-compliant database:

Workflow definitions
Execution history
Credentials (encrypted)
User accounts and settings
Tags and other metadata

PostgreSQL is an excellent choice for n8n because it offers:

Strong data integrity guarantees
Rich feature set for complex queries
Excellent performance for n8n's access patterns
Mature ecosystem with extensive tooling
Open-source with enterprise reliability

Security Best Practices for PostgreSQL

For our production deployment, we implemented several security best practices:

Dedicated Non-Root User: Created a specific database user for n8n access
Password Security: Stored database credentials in Kubernetes Secrets
Network Isolation: Restricting access to within the Kubernetes cluster only
Resource Limits: Setting appropriate CPU and memory limits

PostgreSQL Kubernetes Secrets

First, we created a Kubernetes Secret to store database credentials:

apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
  namespace: n8n
type: Opaque
data:
  POSTGRES_USER: cG9zdGdyZXM=  # "postgres" base64 encoded
  POSTGRES_PASSWORD: cFstNUpxdHM9UyVGYzMrTEY=  # base64 encoded password
  POSTGRES_DB: bjhu  # "n8n" base64 encoded

Note: For security, always generate strong random passwords for production deployments.

We applied this secret:

kubectl apply -f postgres-secret.yaml

Database Initialization ConfigMap

To create a dedicated n8n user in PostgreSQL, we created a ConfigMap with an initialization script:

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-init-script
  namespace: n8n
data:
  init-db.sh: |
    #!/bin/bash
    set -e

    psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
      CREATE USER n8n WITH PASSWORD 'secure-password-here';
      GRANT ALL PRIVILEGES ON DATABASE n8n TO n8n;
      ALTER USER n8n WITH SUPERUSER;
    EOSQL

This script is executed when PostgreSQL starts, creating an n8n user with appropriate privileges.

PostgreSQL Deployment Configuration

Now we can create the PostgreSQL deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  namespace: n8n
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:13
        ports:
        - containerPort: 5432
        env:
        - name: POSTGRES_USER
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: POSTGRES_USER
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: POSTGRES_PASSWORD
        - name: POSTGRES_DB
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: POSTGRES_DB
        volumeMounts:
        - name: postgres-data
          mountPath: /var/lib/postgresql/data
        - name: init-script
          mountPath: /docker-entrypoint-initdb.d
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
      volumes:
      - name: postgres-data
        persistentVolumeClaim:
          claimName: postgres-data-claim
      - name: init-script
        configMap:
          name: postgres-init-script

Key aspects of this configuration:

Uses the PostgreSQL 13 image
Mounts the persistent volume claim for data storage
Mounts the initialization script ConfigMap
Sets resource limits to ensure stability
Uses environment variables from Kubernetes Secrets

PostgreSQL Service

To make PostgreSQL accessible to other pods in the cluster, we created a Kubernetes Service:

apiVersion: v1
kind: Service
metadata:
  name: postgres-service
  namespace: n8n
spec:
  selector:
    app: postgres
  ports:
  - port: 5432
    targetPort: 5432
  type: ClusterIP

This service provides a stable endpoint (postgres-service) for other components to connect to PostgreSQL.

Redis Deployment

Redis for Queue Management

Redis serves as the queue manager for n8n's distributed workflow execution. It:

Maintains lists of pending workflows
Tracks workflow execution state
Enables worker coordination
Provides fast in-memory operations

Redis Deployment Configuration

We deployed Redis with the following configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  namespace: n8n
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:6-alpine
        ports:
        - containerPort: 6379
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        volumeMounts:
        - name: redis-data
          mountPath: /data
      volumes:
      - name: redis-data
        emptyDir: {}

For our use case, we chose a simple Redis deployment without persistence, as the queue data can be regenerated if lost. For more critical deployments, you could add a PersistentVolumeClaim similar to PostgreSQL.

Redis Service

We created a Redis service to make it accessible to n8n components:

apiVersion: v1
kind: Service
metadata:
  name: redis-service
  namespace: n8n
spec:
  selector:
    app: redis
  ports:
  - port: 6379
    targetPort: 6379
  type: ClusterIP

Data Layer Architecture Diagram

Our complete data layer architecture can be visualized as:

flowchart LR
    subgraph "Data Layer"
        subgraph "PostgreSQL"
            pg[("PostgreSQL Pod")]
            pv[("Persistent Volume\n64Gi")]
            pg --> pv
        end

        subgraph "Redis"
            redis[("Redis Pod")]
            mem[("In-Memory Storage")]
            redis --> mem
        end
    end

    subgraph "Consumers"
        n8n["n8n Main"]
        workers["n8n Workers"]
    end

    n8n -.-> pg
    n8n -.-> redis
    workers -.-> pg
    workers -.-> redis

    style pv fill:#f9f,stroke:#333
    style mem fill:#bbf,stroke:#333
    style pg fill:#bfb,stroke:#333
    style redis fill:#bfb,stroke:#333

This architecture provides:

Clear separation between stateful services
Persistent storage for critical data
In-memory performance for queue operations
Accessible services for n8n components

Validation

After deploying PostgreSQL and Redis, we verified their status:

kubectl get pods -n n8n

Successful output looks like:

NAME                       READY   STATUS    RESTARTS   AGE
postgres-xxxxxxxxx-xxxxx   1/1     Running   0          2m
redis-xxxxxxxxx-xxxxx      1/1     Running   0          1m

We also verified the services:

kubectl get services -n n8n

With output:

NAME              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
postgres-service  ClusterIP   10.0.xxx.xxx             5432/TCP   2m
redis-service     ClusterIP   10.0.xxx.xxx             6379/TCP   1m

Now that our data layer is properly set up, we can proceed to deploying the n8n application layer in the next section.

Conclusion

With our data layer successfully deployed, we have reliable PostgreSQL storage for our workflow definitions and execution history, plus Redis for efficient queue management. These components form the persistence backbone of our n8n deployment.

In the next article, we'll deploy the n8n application itself, including the main service and worker nodes for distributed processing. [Continue to Part 4: Application Layer]

Have you implemented PostgreSQL or Redis in Kubernetes before? What database optimization techniques have worked best for your workflow automation systems? Share your insights in the comments!

How to Set Up Azure Kubernetes Service for n8n Workflow Automation

Nikhil Mishra — Sun, 02 Mar 2025 08:33:50 GMT

This is Part 2 of the "Building a Production-Ready n8n Workflow Automation Platform on Azure Kubernetes Service" series. View the complete series here.

Setting Up the Foundation

Creating the AKS Cluster

The first step in our implementation is setting up the Azure Kubernetes Service (AKS) cluster. This forms the foundation of our entire deployment.

Resource Planning

Before creating the cluster, we determined our resource requirements:

Node Count: 2 nodes for basic high availability
VM Size: D2s v3 (2 vCPUs, 8GB RAM) for good performance
Region: East US (chosen for proximity to our users)
Kubernetes Version: 1.25.5 (stable version with good feature support)

Creating the Cluster with Azure CLI

We used Azure CLI for cluster creation to make the process reproducible:

# Create resource group
az group create --name n8n-aks-rg --location eastus

# Create AKS cluster
az aks create \
  --resource-group n8n-aks-rg \
  --name n8n-cluster \
  --node-count 2 \
  --node-vm-size Standard_D2s_v3 \
  --kubernetes-version 1.25.5 \
  --enable-managed-identity \
  --generate-ssh-keys

This command creates a basic AKS cluster with managed identity for simplified authentication. The SSH keys are automatically generated for node access if needed.

Connecting to the Cluster

After cluster creation, we configured kubectl to connect to our new cluster:

# Get credentials
az aks get-credentials --resource-group n8n-aks-rg --name n8n-cluster

# Verify connection
kubectl get nodes

The output confirmed our two nodes were running:

NAME                                STATUS   ROLES   AGE   VERSION
aks-nodepool1-12345678-vmss000000   Ready    agent   3m    v1.25.5
aks-nodepool1-12345678-vmss000001   Ready    agent   3m    v1.25.5

Namespace Organization

We created a dedicated namespace for our n8n deployment to isolate it from other applications that might run on the same cluster:

# Create namespace
kubectl create namespace n8n

# Set as default namespace for this context
kubectl config set-context --current --namespace=n8n

Using a dedicated namespace provides several benefits:

Resource isolation
Simplified RBAC (Role-Based Access Control)
Clear resource organization
Ability to set resource quotas per namespace

Network Architecture

Network Considerations

In our design, we addressed several network requirements:

External Access: The n8n UI must be accessible externally via HTTPS
Inter-Service Communication: Components need to communicate within the cluster
Security: Network policies to restrict unnecessary communication

Implementing the Ingress Controller

For external access, we installed the NGINX Ingress Controller:

# Add the Helm repository
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

# Install NGINX Ingress Controller
helm install nginx-ingress ingress-nginx/ingress-nginx \
  --namespace default \
  --set controller.replicaCount=2 \
  --set controller.nodeSelector."kubernetes\.io/os"=linux \
  --set defaultBackend.nodeSelector."kubernetes\.io/os"=linux

After installation, we verified the ingress controller's external IP:

kubectl get service nginx-ingress-ingress-nginx-controller

This returned our external IP (74.179.239.172) which we later used to configure DNS.

Storage Classes and Persistent Volumes

Storage Architecture

For our n8n deployment, we needed persistent storage for:

PostgreSQL database
Redis data (if needed for persistence)

AKS provides default storage classes that use Azure Disk or Azure File. We used the default storage class (managed-premium) which creates Azure Premium Managed Disks.

Verifying Storage Classes

We checked available storage classes with:

kubectl get storageclass

The output confirmed the default storage class:

NAME                   PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
managed-premium (default)   disk.csi.azure.com   Delete          Immediate           true                   10m
managed-csi               disk.csi.azure.com   Delete          Immediate           true                   10m

Creating Persistent Volume Claims

For PostgreSQL, we created a Persistent Volume Claim (PVC) to ensure data persistence:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-claim
  namespace: n8n
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 64Gi
  storageClassName: managed-premium

We saved this as postgres-pvc.yaml and applied it:

kubectl apply -f postgres-pvc.yaml

This PVC would be used by the PostgreSQL StatefulSet to store database files.

Deployment Process

The overall deployment process follows this sequence:

flowchart TB
    start([Start Deployment]) --> prereq[Prerequisites Check]
    prereq --> aks[Create AKS Cluster]
    aks --> namespace[Create n8n Namespace]

    namespace --> storage[Configure Storage]
    storage --> secrets[Apply Secrets]

    secrets --> dbPath[Database Path]
    secrets --> redisPath[Redis Path]
    secrets --> n8nPath[n8n Path]

    dbPath --> postgres[Deploy PostgreSQL]
    redisPath --> redis[Deploy Redis]

    postgres --> dbInit[Initialize Database]
    redis --> queueInit[Initialize Queue]

    dbInit --> n8n[Deploy n8n Main]
    queueInit --> n8n

    n8n --> workers[Deploy n8n Workers]
    n8n --> ingress[Configure Ingress] 

    ingress --> certmgr[Install Cert-Manager]
    certmgr --> issuer[Configure ClusterIssuer]
    issuer --> cert[Obtain SSL Certificate]

    cert --> validation[Validation Tests]
    workers --> validation

    validation --> complete([Deployment Complete])

    classDef setup fill:#f9f,stroke:#333,stroke-width:1px
    classDef deployment fill:#bbf,stroke:#333,stroke-width:1px
    classDef config fill:#bfb,stroke:#333,stroke-width:1px
    classDef validation fill:#fbf,stroke:#333,stroke-width:1px

    class start,prereq,aks,namespace setup
    class postgres,redis,n8n,workers deployment
    class storage,secrets,ingress,certmgr,issuer,cert config
    class dbInit,queueInit,validation validation

This workflow ensures that dependencies are deployed in the correct order, with each component building upon the previous ones. In the next section, we'll set up the foundational data layer with PostgreSQL and Redis.

Conclusion

With our AKS cluster provisioned and configured, we now have a solid foundation for our n8n deployment. We've set up proper namespaces, configured networking, and prepared our persistent storage requirements.

In the next article, we'll implement the data layer by deploying PostgreSQL and Redis with proper security configurations and persistence. [Continue to Part 3: Data Layer Implementation]

What challenges have you faced when setting up Kubernetes clusters for stateful applications? Share your experiences in the comments!

Building a Production-Ready n8n Workflow Automation Platform on AKS: Introduction & Architecture

Nikhil Mishra — Sun, 02 Mar 2025 08:26:54 GMT

This is Part 1 of the "Building a Production-Ready n8n Workflow Automation Platform on Azure Kubernetes Service" series. View the complete series here.

Introduction

In today's fast-paced digital landscape, workflow automation has become essential for businesses to operate efficiently. As organizations grow, the need for robust, scalable automation solutions becomes increasingly crucial. This blog post details our journey of implementing a production-grade n8n workflow automation platform on Azure Kubernetes Service (AKS).

What is n8n?

n8n (pronounced "n-eight-n") is an open-source workflow automation tool that allows you to connect different services and build automated workflows without writing code. It's like Zapier or Integromat, but with the advantage of being self-hosted, giving you complete control over your data and workflows.

Some key features of n8n include:

Visual workflow editor
200+ built-in integrations
Webhooks for real-time triggers
Custom JavaScript functions
Self-hosting capability
API access

Why Kubernetes for n8n?

While n8n can run on a simple VM or even locally, a production deployment demands more:

High Availability: Ensure your automation workflows run 24/7
Scalability: Handle growing workflow volume as your needs increase
Resource Efficiency: Optimize resource allocation based on actual demand
Disaster Recovery: Protect against data loss and service interruptions
Simplified Management: Standardize deployment, updates, and monitoring

Kubernetes addresses these requirements by providing a container orchestration platform that can manage complex applications with multiple components.

Why Azure Kubernetes Service?

Azure Kubernetes Service (AKS) offers several advantages for hosting n8n:

Managed Control Plane: Focus on your application rather than managing Kubernetes infrastructure
Integrated Security: Azure Active Directory integration, network security policies, and RBAC
Simple Scaling: Easy horizontal and vertical scaling of nodes
Azure Integration: Seamless integration with other Azure services like Storage, Monitoring, and Networking
Cost Optimization: Pay only for the worker nodes, as the control plane is free

First Principles Approach

In this article, we'll take a first principles approach, understanding the fundamental requirements of a production workflow system before implementing our solution. This means:

Starting with the core needs (data persistence, execution reliability, security)
Breaking down the system into logical components
Understanding the interactions between components
Building up a robust architecture that meets all requirements
Implementing with best practices for production environments

Let's dive into the architecture from first principles.

Understanding the Architecture (First Principles)

When designing a production workflow system, we need to consider several fundamental requirements:

1. Data Persistence

Workflows and their execution data must be stored reliably. This requires:

A database for storing workflow definitions and execution records
Persistent storage that survives container restarts
Backup capabilities for disaster recovery

2. Execution Reliability

Workflow executions must be reliable, even under high load:

Queue-based processing to handle spikes in workflow triggers
Worker redundancy to prevent single points of failure
Graceful handling of errors and retries

3. Security

Sensitive data and external access must be secured:

Encryption for data at rest and in transit
Secure storage of credentials and API keys
Authentication and authorization controls
Network security for external access

4. Scalability

The system must scale as workflow needs grow:

Horizontal scaling for workers under load
Database performance optimization
Resource allocation efficiency

5. Maintainability

The deployment must be easy to maintain over time:

Monitoring and logging
Simple update processes
Documentation for operational procedures

With these first principles in mind, we can design our n8n deployment architecture.

Architecture Overview

Our n8n on AKS architecture consists of the following components:

flowchart TD
    subgraph "Azure AKS Cluster"
        subgraph "External Access Layer"
            ingress["NGINX Ingress Controller"]
            cert["Cert-Manager"]
        end

        subgraph "Application Layer"
            n8n["n8n Main\n(UI/API)"]
            worker1["n8n Worker 1"]
            worker2["n8n Worker 2"]
            workern["n8n Worker n\n(Auto-scaled)"]
        end

        subgraph "Data Layer"
            postgres[("PostgreSQL\nDatabase")]
            redis[("Redis\nQueue")]
        end

        ingress <--> cert
        ingress --> n8n
        n8n <--> postgres
        n8n <--> redis
        worker1 <--> postgres
        worker1 <--> redis
        worker2 <--> postgres
        worker2 <--> redis
        workern <--> postgres
        workern <--> redis
    end

    client[("External\nClient")] <--> ingress

Layer Breakdown

1. Data Layer

PostgreSQL: Stores workflow definitions, credentials, and execution records
- Uses persistent volume for data durability
- Configured with appropriate resources for performance
- Non-root user for n8n database access
Redis: Manages workflow execution queue
- Enables distributed execution across workers
- Tracks execution state and enables retries
- Provides inter-process communication

2. Application Layer

n8n Main: Serves the web UI and API
- Handles workflow editing and management
- Processes webhook triggers
- Enqueues workflows for execution
n8n Workers: Execute workflow tasks
- Horizontally scalable based on load
- Pull jobs from Redis queue
- Report execution results back to the database

3. External Access Layer

NGINX Ingress Controller: Routes external traffic
- Terminates SSL/TLS
- Handles HTTP routing rules
- Load balances incoming requests
Cert-Manager: Manages SSL/TLS certificates
- Automates certificate issuance from Let's Encrypt
- Handles certificate renewal
- Configures HTTPS security

Data Flow

External clients connect to the n8n UI/API via the ingress controller
The n8n main service handles UI interactions and API requests
When a workflow is triggered, it's added to the Redis queue
Worker nodes pick up workflow executions from the queue
Workers execute the workflow and store results in PostgreSQL
The n8n main service displays execution results to users

This architecture provides:

Clear separation of concerns
Scalability at each layer
High availability through redundancy
Security through proper isolation

In the next sections, we'll dive into the implementation details of each component, starting with the AKS cluster setup.

Conclusion

In this architecture overview, we've laid out the foundation for a robust n8n deployment on Azure Kubernetes Service. This architecture addresses the core requirements for a production deployment: high availability, scalability, security, and maintainability.

By separating our architecture into distinct layers—data, application, and external access—we've created a modular design that's easier to maintain and troubleshoot. Each component has a specific responsibility, with clear interfaces between them.

In this first article, we've established the foundational principles and architecture for our n8n deployment on AKS. We've explored why n8n is a powerful choice for workflow automation and how Kubernetes provides the ideal platform for a scalable, reliable implementation.

In the next article, we'll turn this architecture into reality by setting up our Azure Kubernetes Service cluster, configuring networking, and preparing the persistent storage foundation. [Continue to Part 2: Setting Up the Foundation]

Have you deployed n8n or similar workflow tools in Kubernetes? Share your experience in the comments!

This is the end of Part 1. Continue to [Part 2: Setting Up the Foundation] to learn how to set up your Azure Kubernetes Service cluster and prepare the foundation for your n8n deployment.

Series Introduction: Building a Production-Ready n8n Workflow Automation Platform on Azure Kubernetes Service

Nikhil Mishra — Sun, 02 Mar 2025 08:04:13 GMT

Welcome to This Series!

In this comprehensive 8-part series, I'll take you through the complete journey of deploying n8n workflow automation platform on Azure Kubernetes Service (AKS) using a first principles approach. Rather than just providing configuration files, we'll explore the reasoning behind each design decision and build up a production-grade system step by step.

What You'll Learn

By the end of this series, you'll understand:

How to design a robust workflow automation architecture from first principles
Best practices for deploying stateful applications on Kubernetes
Implementation of queue-based processing for reliable workflow execution
Security hardening for production deployments
Monitoring, maintenance, and troubleshooting techniques specific to n8n and AKS
Performance and cost optimization strategies

Who This Series Is For

This series is designed for:

DevOps engineers looking to deploy workflow automation tools
Kubernetes administrators seeking stateful application examples
n8n users wanting to scale beyond basic deployments
Cloud architects interested in production-grade Azure implementations
Automation specialists exploring enterprise-ready platforms

While some familiarity with Kubernetes concepts and Azure is helpful, I'll explain key concepts along the way to make this accessible to those newer to these technologies.

Why n8n on Kubernetes?

n8n is a powerful workflow automation tool similar to Zapier or Integromat, but with the advantage of being self-hosted. This means you maintain complete control over your data and workflows.

Running n8n on Kubernetes provides several key benefits:

High availability: Ensure your automation workflows run 24/7
Scalability: Handle growing workflow volume with automatic scaling
Resource efficiency: Optimize resource allocation based on actual demand
Simplified management: Standardize deployment, updates, and monitoring

Series Overview

Here's what we'll cover in this 8-part journey:

Part 1: Introduction and Architecture Overview Understanding the core principles behind a production workflow system and designing a robust architecture.

Part 2: Setting Up the Foundation Creating the AKS cluster, configuring namespaces, networking, and persistent storage.

Part 3: Data Layer Implementation Deploying PostgreSQL and Redis with security best practices and proper persistence.

Part 4: Application Layer Implementing the n8n main service and worker nodes with queue-based processing.

Part 5: External Access and Security Configuring ingress, SSL/TLS encryption, and secure access patterns.

Part 6: Monitoring and Optimization Setting up monitoring, maintenance procedures, and performance optimization.

Part 7: Troubleshooting Guide Comprehensive approach to diagnosing and resolving common issues.

Part 8: Conclusion and Next Steps Summarizing our implementation, reviewing benefits, and exploring advanced enhancements.

Let's Get Started!

Join me on this journey as we build a production-grade n8n deployment from the ground up. Each article in the series will build upon the previous ones, creating a complete system that is scalable, secure, and maintainable.

Building a Production-Ready Static Website with AWS EC2, Nginx, and Cloudflare

Nikhil Mishra — Fri, 28 Feb 2025 17:38:11 GMT

Introduction

In today's digital landscape, deploying static websites efficiently, securely, and cost-effectively is a fundamental skill for developers. In this technical deep dive, I'll walk you through creating a production-grade static website hosting solution using AWS EC2, Nginx, Cloudflare, and Let's Encrypt. This project showcases a robust deployment pipeline that ensures reliability, security, and performance.

Project Overview

We're building a complete static site hosting solution with these core components:

AWS EC2 Instance (Amazon Linux 2023) - Our cloud server
Nginx - Our high-performance web server
Let's Encrypt - For free, automated SSL/TLS certificates
Cloudflare - For DNS management, CDN, and additional security layers
Custom Bash Deployment Script - For automated, secure deployments

Technical Architecture

Let's begin by understanding the system architecture:

graph TD
    A[Client Browser] -->|HTTPS Request| B[Cloudflare DNS]
    B -->|HTTP Request| C[AWS EC2 Instance]
    C -->|Serves| D[Nginx Web Server]
    D -->|Hosts| E[Static Website Files]
    F[Local Development Environment] -->|Deploy via SCP| C

    subgraph "AWS Cloud"
        C
        D
        E
    end

    subgraph "Cloudflare"
        B -->|SSL Termination| B1[Edge Server]
        B1 -->|Cache| B2[CDN]
    end

    classDef aws fill:#FF9900,stroke:#232F3E,color:white;
    classDef cloudflare fill:#F6821F,stroke:#232F3E,color:white;
    classDef nginx fill:#009639,stroke:#232F3E,color:white;
    class C,E aws;
    class B,B1,B2 cloudflare;
    class D nginx;

This diagram illustrates how client requests flow through our infrastructure:

The client's browser sends an HTTPS request to our domain
Cloudflare handles DNS resolution and SSL termination at its edge servers
The request is forwarded to our AWS EC2 instance
Nginx processes the request and serves the static files
The response flows back through the same path to the client

The architecture leverages Cloudflare's global CDN for improved performance and DDoS protection, while keeping our server setup lean and focused.

Automated Deployment Pipeline

For consistent and reliable deployments, we've created a robust bash script that uses SCP to securely transfer files from your local environment to the server:

flowchart TD
    A[Start Deployment] --> B{SSH Key Exists?}
    B -->|No| C[Error: SSH Key Not Found]
    B -->|Yes| D[Test SSH Connection]
    D -->|Failed| E[Error: SSH Connection Failed]
    D -->|Success| F[Create Temporary Directory on Server]
    F -->|Success| G[Copy Files to Temporary Directory]
    G -->|Failed| H[Clean Up & Exit]
    G -->|Success| I[Move Files to Final Location]
    I -->|Failed| J[Clean Up & Exit]
    I -->|Success| K[Deployment Complete]

    style A fill:#4CAF50,stroke:#006400,color:white
    style K fill:#4CAF50,stroke:#006400,color:white
    style C fill:#FF5252,stroke:#B71C1C,color:white
    style E fill:#FF5252,stroke:#B71C1C,color:white
    style H fill:#FF5252,stroke:#B71C1C,color:white
    style J fill:#FF5252,stroke:#B71C1C,color:white

Here's the deployment script:

#!/bin/bash

########
# Author: Your Name
# Date: 2025-02-28
#
# Version: v1.2
#
# Static Site Server Deployment Script
#
# This script uses scp to sync your static site from your local machine to a remote server.
########

# Enable debug mode
set -x

# Change to script directory
cd "$(dirname "$0")" || exit

# Remote server details
REMOTE_USER="ec2-user"
REMOTE_HOST="your-ec2-ip-address"
REMOTE_DIR="/usr/share/nginx/html"

# SSH key path
SSH_KEY="$HOME/.ssh/your_key.pem"

# Check if SSH key exists
if [ ! -f "$SSH_KEY" ]; then
    echo "Error: SSH key not found at $SSH_KEY"
    exit 1
fi

# Test SSH connection first
echo "Testing SSH connection..."
if ! ssh -i "$SSH_KEY" -o BatchMode=yes -o ConnectTimeout=5 "$REMOTE_USER@$REMOTE_HOST" echo "SSH connection successful"; then
    echo "Error: SSH connection failed. Please check your SSH key and server configuration."
    exit 1
fi

# Create a temporary directory on the remote server
echo "Creating temporary directory on remote server..."
TEMP_DIR="/tmp/static-site-$(date +%s)"
if ! ssh -i "$SSH_KEY" "$REMOTE_USER@$REMOTE_HOST" "mkdir -p $TEMP_DIR"; then
    echo "Error: Failed to create temporary directory"
    exit 1
fi

# Copy files to temporary directory
echo "Copying files to remote server..."
if ! scp -i "$SSH_KEY" -r ./* "$REMOTE_USER@$REMOTE_HOST:$TEMP_DIR/"; then
    echo "Error: Failed to copy files"
    ssh -i "$SSH_KEY" "$REMOTE_USER@$REMOTE_HOST" "rm -rf $TEMP_DIR"
    exit 1
fi

# Move files to final location
echo "Moving files to final location..."
if ! ssh -i "$SSH_KEY" "$REMOTE_USER@$REMOTE_HOST" "sudo rm -rf $REMOTE_DIR/* && sudo cp -r $TEMP_DIR/* $REMOTE_DIR/ && sudo chown -R nginx:nginx $REMOTE_DIR && sudo chmod -R 755 $REMOTE_DIR && rm -rf $TEMP_DIR"; then
    echo "Error: Failed to move files to final location"
    ssh -i "$SSH_KEY" "$REMOTE_USER@$REMOTE_HOST" "rm -rf $TEMP_DIR"
    exit 1
fi

echo "Deployment completed successfully!"

This script includes several best practices:

SSH connection validation before attempting deployment
Temporary directory usage for atomic deployments
Proper error handling with cleanup on failure
Appropriate file permissions for security

Cloudflare Integration: DNS and Security

Cloudflare provides an additional layer of protection and performance optimization:

sequenceDiagram
    participant User as User
    participant Browser as Browser
    participant Cloudflare as Cloudflare
    participant EC2 as EC2 Instance
    participant Nginx as Nginx Server

    User->>Browser: Enter your-domain.com
    Browser->>Cloudflare: DNS Resolution
    Cloudflare-->>Browser: IP Address (EC2)
    Browser->>Cloudflare: HTTPS Request
    Note over Cloudflare: SSL Termination
    Cloudflare->>EC2: HTTP Request
    EC2->>Nginx: Forward Request
    Nginx->>Nginx: Process Request
    Note over Nginx: Find Static Files
    Nginx-->>EC2: Serve HTML/CSS/JS
    EC2-->>Cloudflare: HTTP Response
    Cloudflare-->>Browser: HTTPS Response
    Browser-->>User: Display Content

To set up Cloudflare:

Add your domain to Cloudflare and update nameservers
Create an A record pointing to your EC2 instance's IP address
Configure SSL/TLS settings:
- For maximum security: Full (strict) mode (requires valid SSL cert on server)
- For simpler setup: Full mode (works with self-signed certs)
Enable additional security features:
- Always Use HTTPS
- HSTS (HTTP Strict Transport Security)
- Browser Integrity Check

Performance Optimization

To ensure optimal performance, we implemented several optimizations:

Nginx Configuration Tuning:
- Gzip compression for reduced bandwidth usage
- Optimized worker processes based on CPU cores
- File cache settings for frequently accessed content
Cloudflare Performance Settings:
- Auto Minify for HTML, CSS, and JavaScript
- Brotli compression (more efficient than gzip)
- Rocket Loader for asynchronous JavaScript loading
Static Asset Optimization:
- WebP image format for better compression
- Defer loading of non-critical resources
- Cache control headers for optimal browser caching

Monitoring and Maintenance

For ongoing maintenance, we set up:

Log Rotation:

sudo logrotate -d /etc/logrotate.d/nginx

Simple Health Check Script:

#!/bin/bash
# health_check.sh
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://your-domain.com)
if [ "$HTTP_STATUS" -ne 200 ]; then
echo "Site is down! HTTP Status: $HTTP_STATUS"
# Add notification logic here (email, SMS, etc.)
fi

Basic Performance Monitoring:

# Monitor Nginx process and resource usage
sudo watch -n 5 "ps aux | grep nginx"

Challenges and Solutions

Challenge 1: Atomic Deployments

Problem: How to update the site without downtime or showing partial updates?

Solution: Our deployment script uses a temporary directory approach, only replacing the files after a complete copy is successful. This ensures users never see a partially updated site.

Challenge 2: SSL Certificate Management

Problem: Manual SSL certificate renewal is error-prone and can lead to outages.

Solution: Automated certificate renewal through certbot's cron job:

echo "0 3 * * * root certbot renew --quiet" | sudo tee -a /etc/crontab

Challenge 3: Security Hardening

Problem: Default configurations are often not secure enough for production.

Solution: Implemented multiple security layers:

Strong Nginx security headers
Cloudflare WAF (Web Application Firewall)
Regular security patches via automatic updates
Limited SSH access to specific IP addresses

Future Enhancements

Looking ahead, several enhancements could further improve this setup:

CI/CD Pipeline Integration: Connecting with GitHub Actions or similar CI/CD tools for automated testing and deployment.
Infrastructure as Code: Converting the manual setup to Terraform or CloudFormation templates.
Advanced Monitoring: Implementing more comprehensive monitoring with tools like Prometheus and Grafana.
Content Versioning: Implementing a blue-green deployment strategy for zero-downtime updates with rollback capability.

Conclusion

This project demonstrates how to build a robust, secure, and performant static website hosting infrastructure using AWS EC2, Nginx, Let's Encrypt, and Cloudflare. The architecture provides multiple layers of security, optimized performance, and a streamlined deployment process.

By following these steps, you can create a production-grade hosting environment for static websites that strikes an excellent balance between cost, performance, and security. The modular approach also makes it easy to scale or modify specific components as your needs evolve.

Want to see the full code? Check out the project repository on GitHub!

Published on February 28, 2025

🔄 Custom Subdomain Forwarding with Cloudflare

Nikhil Mishra — Wed, 19 Feb 2025 09:22:40 GMT

🎯 What We're Building

Subdomain	Forwards To
iam.nikhilmishra.live	bento.me/kaalpanikh
links.nikhilmishra.live	linktr.ee/kaalpanikh

🔄 How It Works

%%{init: {"theme": "default"}}%%
graph LR
    A[User] -->|Visits| B[iam.nikhilmishra.live]
    A -->|Visits| C[links.nikhilmishra.live]

    subgraph Cloudflare
        B -->|DNS Record| D[CNAME to @]
        C -->|DNS Record| E[CNAME to @]
        D -->|Page Rule| F[301 Redirect]
        E -->|Page Rule| G[301 Redirect]
    end

    F -->|Forwards to| H[bento.me/kaalpanikh]
    G -->|Forwards to| I[linktr.ee/kaalpanikh]

    style Cloudflare fill:#F6821F,stroke:#F6821F,stroke-width:2px
    style H fill:#5D45F9,stroke:#5D45F9,stroke-width:2px
    style I fill:#39E09B,stroke:#39E09B,stroke-width:2px

Note: This setup uses Cloudflare as a workaround since free Bento and Linktree plans don't support custom domains.

📝 Step-by-Step Guide

1️⃣ Create DNS Records

Log in to your Cloudflare Dashboard
Select your domain: nikhilmishra.live
Go to DNS → Records
Add the following records:

For Bento Profile

Type:   CNAME
Name:   iam
Target: @
Proxy:  ✅ Enabled (Orange Cloud)

For Linktree Profile

Type:   CNAME
Name:   links
Target: @
Proxy:  ✅ Enabled (Orange Cloud)

2️⃣ Set Up Page Rules

Navigate to Rules → Page Rules
Create two page rules:

Bento Redirect

URL Pattern: https://iam.nikhilmishra.live/*
Forward to: https://bento.me/kaalpanikh
Status:     301 (Permanent Redirect)

Linktree Redirect

URL Pattern: https://links.nikhilmishra.live/*
Forward to: https://linktr.ee/kaalpanikh
Status:     301 (Permanent Redirect)

3️⃣ Verify Setup

After DNS propagation (usually 5-10 minutes):

Visit iam.nikhilmishra.live
- Should redirect to your Bento profile.
Visit links.nikhilmishra.live
- Should redirect to your Linktree profile.

⚠️ Troubleshooting

If redirects aren't working:

Check if Cloudflare proxy is enabled (orange cloud).
Verify page rules are in the correct order.
Clear your browser cache.
Wait a few more minutes for DNS propagation.

🔗 Useful Links

I escalated my app to the cloud !

Nikhil Mishra — Thu, 11 Jul 2024 12:45:33 GMT

Context :

I deployed a web app locally, and now I want to move it to the cloud for scalability, reliability, ease of management, and automation.

Here, I am using IaaS.

Major Services used will be :

ELB for Nginx
EC2 instead of local vms
Route 53 for DNS
S3 for artifact storage
Auto Scaling Group
IAM
ACM

The Architecture changes :

The user will log in through an endpoint hosted on GoDaddy.

They will access the endpoint via HTTPS, with the certificate managed by ACM.

The user will connect to the ELB endpoint, which only permits HTTPS traffic.

The ELB will then route the user to the application server.

The application server consists of Tomcat instances, managed by an autoscaling group that adjusts based on traffic and only allows traffic on port 8080.

The application server requires access to backend servers, managed by a Route 53 private hosted zone.

These backend servers, which include MySQL, RabbitMQ, and Memcache, are in a separate security group and will only allow traffic on their specific ports.

I have a certificate ready in ACM issued by GoDaddy, which I obtained by requesting and adding a CNAME record.

Now I made 3 security groups:

For the load balancer, allowing HTTP and HTTPS traffic.

For the app server, allowing port 8080 from the load balancer's security group. Also added SSH and port 8080 access from my IP.

For the backend, allowing port 3306 for MySQL, 11211 for Memcached, and 5672 for RabbitMQ from the app server's security group. Also allowed all traffic within the group and SSH for validation.

I also have my key ready :

Cloned the repo and got the scripts for launching instances

Instances :

Launched DB, memcache, and RabbitMQ with Amazon Linux.

Launched the app server with Ubuntu.

Validated that all services and scripts are working

Created a hosted zone with simple routing rules by adding A records for backend servers using their private IPs.

Now, you need to build and upload the artifact.

Update the application.propertiesfile with the correct server routes. Built the arctifact locally using JDK and Maven.

create an IAM user to upload the artifact by creating an S3 bucket and pushing the artifact there.

Created an IAM role and gave access to the app server to download the artifact and start the Tomcat service for our app.

Load Balancer :

Now, for setting up a load balancer, I created a target group and added our instance that listens on port 8080 as the target. I also set up a health check at /login on port 8080.

I then created an internet-facing application load balancer that is available on all subnets. It listens for both HTTP and HTTPS traffic by adding the target group and certificate.

now, we have our application up and running when we add the elb endpoint to dns as a cname record

Here's our working app with all services validated

https://youtu.be/H1aIZtZ50sE

Auto Scaling :

I wanted to make my app ready to scale up based on traffic. To add an auto-scaling group, I first created an AMI of my app instance.

created a launch template using the ami

Then, I created an auto-scaling group with desired triggers and added alarms using SNS.

And we are done, here is our webapp working, on cloud :

If you like what Im working on, leave some feedback in comments, a like would be great also subscribe to my newsletter for more such blog delivered straight to your inbox.

Thank you !

~/Blog/Nikhil

Code, Vibes and Nostalgia

My Journey with Vibe Coding 🚀

Explore & Share 🎮✨

Thanks for Reading!

From First Principles to Production: Automating GitHub Pages Deployments with GitHub Actions

Introduction

Understanding CI/CD from First Principles

GitHub Actions: Architecture and Components

Anatomy of a GitHub Pages Deployment Workflow

1. Event-Driven Architecture

2. Declarative Configuration

3. Security-First Design

4. Sequential Pipeline Architecture

Breaking Down the Workflow: Step-by-Step Analysis

1. Checkout

2. Setup Pages

3. Upload Artifact

4. Deploy

The Principle of Least Privilege in Action

Environment-Based Deployment

First Principles of Static Site Deployment

The Complete Deployment Flow

Implementation Considerations

1. Workflow Isolation

2. Artifact Immutability

3. Idempotent Deployments

4. Failure Handling

Extending the Workflow

1. Testing

2. Performance Optimization

3. Security Scanning

Conclusion: From Principles to Practice

Building a Cloud-Native DevOps Pipeline from First Principles

Introduction

Project Architecture Overview

First Principles: Understanding the Core Concepts

Infrastructure as Code (IaC)

Containerization

Orchestration

Continuous Integration and Continuous Deployment (CI/CD)

Implementing the Infrastructure with Terraform

VPC Configuration

EKS Cluster Configuration

Terraform Workflow Automation

Application Architecture and Containerization

vProfile Application Components

Multi-stage Docker Build

Kubernetes Deployment with Helm

Helm Charts Structure

Application Deployment

Ingress Configuration

CI/CD Pipeline with GitHub Actions

Application CI/CD Workflow

The Complete System: Integration and Flow

Security and Best Practices

Challenges and Learning Outcomes

Conclusion

Future Enhancements

AushadhiAI: AI-Powered Prescription Analysis System

Introduction

System Architecture

Key Components

Technical Implementation Details

Backend System

Key Backend Components:

Prescription Analysis Process

Frontend Implementation

Deployment Architecture

Deployment Components:

System Features

1. Robust OCR Capabilities

2. Medication Identification

3. Error Handling and Resilience

Performance Considerations

Security Implementation

Future Enhancements

Conclusion

From Logs to Insights: Unleashing the Power of Nginx Log Analysis

Introduction