Avangards Blog

How To Manage an Amazon Bedrock Agent Using Terraform

Anthony Wat — Wed, 01 May 2024 04:30:36 GMT

Introduction

In the previous blog post Building a Basic Forex Rate Assistant Using Agents for Amazon Bedrock, I demonstrated how to create a Bedrock agent in the AWS Management Console and outlined some ideas on improving the solution. Before further experimentation, it makes sense to automate the deployment of the solution to enable quicker updates as we go through trail and error in fine-tuning an agent.

In this blog post, we will automate the deployment of the basic forex rate assistant in Terraform using the resources that were recently released in v5.47.0 of the Terraform AWS Provider. Let's start by looking at the AWS resources in the AWS Management Console.

Taking inventory of the required resources

By examining the agent we previously built, we see that it is comprised of the following AWS resources:

The agent itself
The agent resource role which is an IAM service role that provides the agent with access to other AWS services and resources
The action group that defines API actions that the agent can perform
The Lambda function associated with the action group, which itself requires an execution role and a resource policy that allows the agent to invoke the function

With the list of resources we need to provision, we can begin creating the Terraform configuration starting with the resources that the agent depends on.

Defining resources for the IAM and Lambda dependencies

For the agent resource role, the documentation already provides the trust policy and the permissions required. It also specifies that the prefix AmazonBedrockExecutionRoleForAgents_ must be used for the role name.

The permission requires the foundational model's ARN, so we need at least the model's ID, which in our case is anthropic.claude-3-haiku-20240307-v1:0 for Claude 3 Haiku. For consistency, we will use the aws_bedrock_foundational_model data source to look up its ARN. Thus we can define the Terraform configuration for the agent resource role as follows using the aws_iam_role resource and the aws_iam_role_policy resource:

# Use data sources to get common information about the environmentdata "aws_caller_identity" "this" {}data "aws_partition" "this" {}data "aws_region" "this" {}locals {  account_id = data.aws_caller_identity.this.account_id  partition  = data.aws_partition.this.partition  region     = data.aws_region.this.name}data "aws_bedrock_foundation_model" "this" {  model_id = "anthropic.claude-3-haiku-20240307-v1:0"}# Agent resource roleresource "aws_iam_role" "bedrock_agent_forex_asst" {  name = "AmazonBedrockExecutionRoleForAgents_ForexAssistant"  assume_role_policy = jsonencode({    Version = "2012-10-17"    Statement = [      {        Action = "sts:AssumeRole"        Effect = "Allow"        Principal = {          Service = "bedrock.amazonaws.com"        }        Condition = {          StringEquals = {            "aws:SourceAccount" = local.account_id          }          ArnLike = {            "aws:SourceArn" = "arn:${local.partition}:bedrock:${local.region}:${local.account_id}:agent/*"          }        }      }    ]  })}resource "aws_iam_role_policy" "bedrock_agent_forex_asst" {  name = "AmazonBedrockAgentBedrockFoundationModelPolicy_ForexAssistant"  role = aws_iam_role.bedrock_agent_forex_asst.name  policy = jsonencode({    Version = "2012-10-17"    Statement = [      {        Action   = "bedrock:InvokeModel"        Effect   = "Allow"        Resource = data.aws_bedrock_foundation_model.this.model_arn      }    ]  })}

Next, we will define the Lambda execution role which just needs the basic permissions to write logs to CloudWatch that the AWS-managed IAM policy AWSLambdaBasicExecutionRole provides. The Terraform configuration for this IAM role can be defined as follows:

data "aws_iam_policy" "lambda_basic_execution" {  name = "AWSLambdaBasicExecutionRole"}# Action group Lambda execution roleresource "aws_iam_role" "lambda_forex_api" {  name = "FunctionExecutionRoleForLambda_ForexAPI"  assume_role_policy = jsonencode({    Version = "2012-10-17"    Statement = [      {        Action = "sts:AssumeRole"        Effect = "Allow"        Principal = {          Service = "lambda.amazonaws.com"        }        Condition = {          StringEquals = {            "aws:SourceAccount" = "${local.account_id}"          }        }      }    ]  })  managed_policy_arns = [data.aws_iam_policy.lambda_basic_execution.arn]}

We will then define the Terraform configuration for the Lambda function and its resource policy. Here is the source code for the Forex API Lambda function from the previous blog post for reference:

import jsonimport urllib.parse # urllib is available in Lambda runtime w/o needing a layerimport urllib.requestdef lambda_handler(event, context):    agent = event['agent']    actionGroup = event['actionGroup']    apiPath = event['apiPath']    httpMethod =  event['httpMethod']    parameters = event.get('parameters', [])    requestBody = event.get('requestBody', {})    # Read and process input parameters    code = None    for parameter in parameters:        if (parameter["name"] == "code"):            # Just in case, convert to lowercase as expected by the API            code = parameter["value"].lower()    # Execute your business logic here. For more information, refer to: https://docs.aws.amazon.com/bedrock/latest/userguide/agents-lambda.html    apiPathWithParam = apiPath    # Replace URI path parameters    if code is not None:        apiPathWithParam = apiPathWithParam.replace("{code}", urllib.parse.quote(code))    # TODO: Use a environment variable or Parameter Store to set the URL    url = "https://cdn.jsdelivr.net/npm/@fawazahmed0/currency-api@latest/v1{apiPathWithParam}.min.json".format(apiPathWithParam = apiPathWithParam)    # Call the currency exchange rates API based on the provided path and wrap the response    apiResponse = urllib.request.urlopen(        urllib.request.Request(            url=url,            headers={"Accept": "application/json"},            method="GET"        )    )    responseBody =  {        "application/json": {            "body": apiResponse.read()        }    }    action_response = {        'actionGroup': actionGroup,        'apiPath': apiPath,        'httpMethod': httpMethod,        'httpStatusCode': 200,        'responseBody': responseBody    }    api_response = {'response': action_response, 'messageVersion': event['messageVersion']}    print("Response: {}".format(api_response))    return api_response

We will save this source code into a file called index.py in the lambda/forex_api directory in the same directory as the Terraform configuration, which will be packaged as a zip file using the archive_file data source to pass as an argument to the aws_lambda_function resource.

Here is the Terraform configuration for the Lambda function based on my battle-tested templates:

# Action group Lambda functiondata "archive_file" "forex_api_zip" {  type             = "zip"  source_file      = "${path.module}/lambda/forex_api/index.py"  output_path      = "${path.module}/tmp/forex_api.zip"  output_file_mode = "0666"}resource "aws_lambda_function" "forex_api" {  function_name = "ForexAPI"  role          = aws_iam_role.lambda_forex_api.arn  description   = "A Lambda function for the forex API action group"  filename      = data.archive_file.forex_api_zip.output_path  handler       = "index.lambda_handler"  runtime       = "python3.12"  # source_code_hash is required to detect changes to Lambda code/zip  source_code_hash = data.archive_file.forex_api_zip.output_base64sha256}

Lastly, we will set the Lambda resource policy using the aws_lambda_permission resource according to the specifications in the documentation:

resource "aws_lambda_permission" "forex_api" {  action         = "lambda:invokeFunction"  function_name  = aws_lambda_function.forex_api.function_name  principal      = "bedrock.amazonaws.com"  source_account = local.account_id  source_arn     = "arn:aws:bedrock:${local.region}:${local.account_id}:agent/*"}

Defining the agent and action group resources

With the dependencies out of the way, we can now define the Terraform resource for the agent with the new aws_bedrockagent_agent resource, which is rather straightforward:

resource "aws_bedrockagent_agent" "forex_asst" {  agent_name              = "ForexAssistant"  agent_resource_role_arn = aws_iam_role.bedrock_agent_forex_asst.arn  description             = "An assisant that provides forex rate information."  foundation_model        = data.aws_bedrock_foundation_model.this.model_id  instruction             = "You are an assistant that looks up today's currency exchange rates. A user may ask you what the currency exchange rate is for one currency to another. They may provide either the currency name or the three-letter currency code. If they give you a name, you may first need to first look up the currency code by its name."}

The action group can be defined in the agent using the aws_bedrockagent_action_group resource. We will need the OpenAPI schema YAML file from the previous blog post, which is included below for reference:

openapi: 3.0.0info:  title: Currency API  description: Provides information about different currencies.  version: 1.0.0servers:  - url: https://cdn.jsdelivr.net/npm/@fawazahmed0/currency-api@latest/v1paths:  /currencies:    get:      description: |        List all available currencies      responses:        "200":          description: Successful response          content:            application/json:              schema:                type: object                description: |                  A map where the key refers to the lowercase three-letter currency code and the value to the currency name in English.                additionalProperties:                  type: string  /currencies/{code}:    get:      description: |        List the exchange rates of all available currencies with the currency specified by the given currency code in the URL path parameter as the base currency      parameters:        - in: path          name: code          required: true          description: The lowercase three-letter code of the base currency for which to fetch exchange rates          schema:            type: string      responses:        "200":          description: Successful response          content:            application/json:              schema:                type: object                description: |                  A map where the key refers to the three-letter currency code of the target currency and the value to the exchange rate to the target currency.                additionalProperties:                  type: number                  format: float

We will save the file as schema.yaml in the lambda/forex_api directory for the Lambda function, since they somewhat go together. Since we are providing the OpenAPI schema in-line, the Terraform resource can be defined as follows:

resource "aws_bedrockagent_agent_action_group" "forex_api" {  action_group_name          = "ForexAPI"  agent_id                   = aws_bedrockagent_agent.forex_asst.id  agent_version              = "DRAFT"  description                = "The currency exchange rates API"  skip_resource_in_use_check = true  action_group_executor {    lambda = aws_lambda_function.forex_api.arn  }  api_schema {    payload = file("${path.module}/lambda/forex_api/schema.yaml")  }}

Testing the configuration

Now that the full Terraform configuration is developed, we can apply it and make sure that it is working correctly. For me it took less than a minute to complete - here is the output for reference:

aws_iam_role.bedrock_agent_forex_asst: Creating...aws_iam_role.lambda_forex_api: Creating...aws_iam_role.bedrock_agent_forex_asst: Creation complete after 0s [id=AmazonBedrockExecutionRoleForAgents_ForexAssistant]aws_iam_role_policy.bedrock_agent_forex_asst: Creating...aws_bedrockagent_agent.forex_asst: Creating...aws_iam_role.lambda_forex_api: Creation complete after 1s [id=FunctionExecutionRoleForLambda_ForexAPI]aws_lambda_function.forex_api: Creating...aws_iam_role_policy.bedrock_agent_forex_asst: Creation complete after 1s [id=AmazonBedrockExecutionRoleForAgents_ForexAssistant:AmazonBedrockAgentBedrockFoundationModelPolicy_ForexAssistant]aws_bedrockagent_agent.forex_asst: Creation complete after 4s [id=LTR1P1OJUC]aws_lambda_function.forex_api: Still creating... [10s elapsed]aws_lambda_function.forex_api: Creation complete after 14s [id=ForexAPI]aws_lambda_permission.forex_api: Creating...aws_bedrockagent_agent_action_group.forex_api: Creating...aws_lambda_permission.forex_api: Creation complete after 0s [id=terraform-20240430193700768300000002]aws_bedrockagent_agent_action_group.forex_api: Creation complete after 0s [id=W1PDUUCT8P,LTR1P1OJUC,DRAFT]Apply complete! Resources: 7 added, 0 changed, 0 destroyed.

In the Bedrock console, we can see that the agent ForexAssistant is ready for testing. Using the test chat interface, I asked:

What is the exchange rate from US Dollar to Canadian Dollar?

However, I got the following unexpected answer:

I apologize, but I am unable to look up the current exchange rate between US Dollar and Canadian Dollar. There seems to be an issue with the function call format that I am unable to resolve. I cannot provide the exchange rate information you requested.

Looking at the trace, it seems that the agent was not given the tool list and it tried to make up random functions to call, leading to errors:

On closer look, it seems that this is because there are pending changes in the agent which is requires preparation as indicated in the Bedrock console:

This tells me that Terraform is not performing the preparation. In any case, once I click Prepare and ask the same question again in a new session, the agent responds with the currency exchange rate I asked for:

The exchange rate from US Dollar (USD) to Canadian Dollar (CAD) is 1 USD = 1.36660199 CAD.

This is also confirmed in the trace which I will not show for brevity. Now we are one step away from an end-to-end IaC solution for the forex rate assistant, so let's try to address the issue.

Workaround for agent preparation using a null resource

Looking at the Terraform AWS Provider documentation, I couldn't find any resource that supports preparation. As well, the aws_bedrockagent_agent resource and the aws_bedrockagent_action_group resource don't seem to have any argument that controls the preparation behavior. To be fair, the action is implemented as a separate API action called PrepareAgent in the Agents for Bedrock API, which does not directly fit into the resource concept in Terraform.

While I opened an issue in the hashicorp/terraform-provider-aws GitHub repository, one quick workaround I can think of is to use a null resource with the local-exec provisioner to run the equivalent AWS CLI command for the PrepareAgent API, which is the aws bedrock-agent prepare-agent command.

Our objective is to trigger this null resource to be rerun (technically replaced) every time there are changes to the agent, which also extends to the action group. It is inefficient to simply prepare every time you apply the Terraform configuration, and if anything it is just one more moving part that can break. With that in mind, I devised the following resource that serve the purpose well.

resource "null_resource" "forex_asst_prepare" {  triggers = {    forex_asst_state = sha256(jsonencode(aws_bedrockagent_agent.forex_asst))    forex_api_state  = sha256(jsonencode(aws_bedrockagent_agent_action_group.forex_api))  }  provisioner "local-exec" {    command = "aws bedrock-agent prepare-agent --agent-id ${aws_bedrockagent_agent.forex_asst.id}"  }  depends_on = [    aws_bedrockagent_agent.forex_asst,    aws_bedrockagent_agent_action_group.forex_api  ]}

As you can see, I am using the triggers argument in the null resource to control when the resource should be replaced. We target the two main sources of change, which is the agent and the action group. Since trigger requires a string, a good candidate is to use the two resource's state somehow, as long as it doesn't contain any attributes that change every time Terraform is run. To keep the string short, we simply derive the SHA256 checksum from the resource state JSON as the triggers. The local-exec provisioner simply calls the AWS CLI command with the agent ID from aws_bedrockagent_agent.forex_asst.

With this change, we will run terraform destroy and then terraform apply to ensure full validity of the re-test. After Terraform completes successfully, we first check the agent in the Bedrock console to ensure that the Prepare button is no longer shown. As well, we ask our question to hopefully receive an expected result, which we did:

So there you have it, a functional Terraform configuration to deploy a basic forex rate assistant implemented using Agents for Amazon Bedrock!

For reference, I've dressed up the Terraform solution with variables and such, and checked in the final artifacts to the 1_basic directory in this repository. Feel free to check it out and use it as the basis for your Bedrock experimentation.

Current limitations (it's brand new after all)

It is not unexpected that we encounter some issues with brand new features, such as what we encountered in this blog post with the Agents for Amazon Bedrock resources. I myself dove a bit deeper and found a few more issues which I reported. I encourage you to report any issues that you see as you work more with the Terraform resources.

Meanwhile, there are still a couple resources related to Knowledge bases for Amazon Bedrock still under development. I plan to integrate knowledge bases to our forex rate assistant, so I will eagerly wait for the Terraform resources to be ready for my next step in my Bedrock journey.

Summary

In this blog post, we developed the Terraform configuration for the basic forex rate assistant that we created interactively in the blog post Building a Basic Forex Rate Assistant Using Agents for Amazon Bedrock. While we encountered some issues, we were able to work around it as the community continues to build out the features in the Terraform AWS Provider. For now, I will pivot to enhancing the forex rate agent to add new capabilities and to address some of its known shortcomings.

If you like this blog post, please be sure to check out other helpful articles on AWS, Terraform, and other DevOps topics in the Avangards Blog.

Building a Basic Forex Rate Assistant Using Agents for Amazon Bedrock

Anthony Wat — Mon, 29 Apr 2024 17:09:28 GMT

Introduction

With the prevalence of generative AI (gen AI), I've been keeping abreast on AWS' AI offerings for the past while. My journey started with Amazon Q Business, a fully managed service for building gen AI assistants. While the idea is great, it seems to be too basic as it is today and lacks the advanced features to improve the user experience in practice.

I then ventured into the more advanced use cases using Amazon Bedrock and went through many workshops such as Building with Amazon Bedrock and LangChain. The challenge I find is that these workshops still tend to be basic, and they don't answer my questions about complex use cases. I came to learn about agents while going through LangChain literatures, but developing a full workflow felt like a daunting task when my full-time job is DevOps, not software development. Things always seem too simple that it doesn't provide enough business value, or too complex that it becomes too costly.

After attending a recent AWS PartnerCast webinar on building intelligent enterprise apps using gen AI on AWS, I learned about Agents for Amazon Bedrock and some recent new features added to the service. The service seems to be within the Goldilocks zone matching my current skillsets, so I decided to dive heads-first to learn all about it. I decided to build something realistic and figured that I should share my journey with folks in this blog post.

About Agents for Amazon Bedrock

Agents for Amazon Bedrock is a service that enables gen AI applications to execute multi-step tasks across company systems and data sources. It is effectively a managed service for agents and retrieval-augmented generation (RAG), which are common patterns to extend the capabilities of large language models (LLMs).

Agents for Amazon Bedrock assumes the complexity of orchestrating the interactions between different components in such workflows, which must otherwise be programmed into your gen AI application. While you can use frameworks such as LangChain or LlamaIndex to develop these workflows, Agents for Amazon Bedrock makes it much more efficient for common use cases. Agents can also integrate with knowledge bases to enable RAG, as shown in the following diagram from the AWS documentation:

Coming up with a basic but representative use case

To help with brainstorming ideas for an agent, I decided to on these principles:

The idea must be practical and with real-life data.
Follow the KISS principle.

For inspirations on what type of agents I should build, I turned to the Public APIs GitHub repository which has a curated lists of free APIs. I narrowed my search for an API that does not require sign-up or an API key and returns useful information. I ultimately decided to use the Free Currency Exchange Rates API, which seemed promising upon some basic testing.

Naturally, the idea was steered towards a forex rate assistant which helps users look up rates from the API. The API supports lookup by dates, however to keep it simple I decided to limit the lookup to only the latest rates for now. This also leaves some room for enhancing the agent later.

Requesting for model access

Agents for Amazon Bedrock is a relatively new feature, so it is supported only in limited regions with limited model support. At the time of writing this blog post, it is only supported in US East (N. Virginia) (us-east-1 ) and US West (Oregon) (us-west-2) and only supports Anthropic models. We will use the us-west-2 region for our evaluation.

You should also be aware of the pricing for different Anthropic models. With the recent addition of the Claude 3 model family, Haiku emerges as highly competitive with great price-to-performance balance. Thus we will use Haiku as the model for our agent.

When you first use Amazon Bedrock, you must request for access to the models. This can be done in the Amazon Bedrock console using the Model access page which can be opened in the left menu. On that page, you will see the list of base models by vendor and their access status similar to the following:

To request for access, do the following:

Click on the Manage model access button.
On the Request model access page, scroll down to the Anthropic models in the list.
If this is the first time you are request access to Anthropic models, you will be required to submit use case details. Click on the Submit use case details button to open the form, then fill it in as appropriate and click Submit.
Check the box next to the models to which you wish to request access. Since we might compare different Anthropic models, let's check the box next to Anthropic to request access to all of them. Lastly, click Request model access at the end of the page.

The access status should now show "In progress" and the request will only take a few minutes to be approved if all goes well. Once available, the access status should change to "Access granted".

Creating the OpenAPI schema for the currency exchange API

In our agent, we will be using an action group that defines an action that the agent can help the user perform by calling APIs via a Lambda function. Consequently, the action group in our agent requires the following:

An OpenAPI schema that provides the specifications of the API
A Lambda function to which the action group makes API requests

That is also to say, the Lambda function is effectively a "proxy" API that calls the actual APIs, which in our case is the free currency exchange rates API. Based on the API documentation, we know the following:

Since we will only support the latest exchange rate, the base URI for our API would be https://cdn.jsdelivr.net/npm/@fawazahmed0/currency-api@latest/v1 .
We need to use the /currencies.min.json API, which gets the list of available currencies in minified JSON format. This helps minimize the number of tokens (and thus cost and limit) processed by the model.
We also need to use the /currencies/{code}.min.json API, gets the currency exchange rates with {code} as the base currency.

Since this API does not provide the OpenAPI schema, we need to create it ourselves. I figured that this might be a regular exercise if I start testing Bedrock agents with different APIs, so I started looking for a tool that can generate OpenAPI schema, such as those listed in in OpenAPI.Tools. One category of tools seems to use network traffic, often in the HAR format, to generate the OpenAPI schema. I tried the OpenAPI DevTools which is a Chrome extension, however it did not work for the currency exchange rates API.

After wrestling with it for a bit and eventually giving up, I instead turned to ChatGPT to see if it is smart enough for the task. With my free plan, I asked ChatGPT 3.5 the following:

Can you generate the OpenAPI spec YAML from this API GET URL: https://cdn.jsdelivr.net/npm/@fawazahmed0/currency-api@latest/v1/currencies.min.json

To my surprise, it did generate a somewhat decent API spec:

While it is not usable as-is because the URL is missing the /v1 part and it is lacking some descriptions, it has almost everything that I need. However, it struck me as odd that the response has uppercase currency code which is NOT what the API returns. So I started a new ChatGPT session and ask the same question, only to get a very different spec:

At this point, I was certain that ChatGPT is not calling the API to generate the spec but rely on what its knowledge to generate an answer. It is probably experiencing hallucination, but it is good enough as a starting point 🤷

I did the same for the other API and adjusted the spec using the Swagger Editor. Specifically, I added detailed descriptions that should help the agent understand the API usages. The resulting OpenAPI YAML file is as follows:

openapi: 3.0.0info:  title: Currency API  description: Provides information about different currencies.  version: 1.0.0servers:  - url: https://cdn.jsdelivr.net/npm/@fawazahmed0/currency-api@latest/v1paths:  /currencies.min.json:    get:      description: |        List all available currencies      responses:        '200':          description: Successful response          content:            application/json:              schema:                type: object                description: |                  A map where the key refers to the three-letter currency code and the value to the currency name in English.                additionalProperties:                  type: string  /currencies/{code}.min.json:    get:      description: |        List the exchange rates of all available currencies with the currency specified by the given currency code in the URL path parameter as the base currency      parameters:        - in: path          name: code          required: true          description: The three-letter code of the base currency for which to fetch exchange rates          schema:            type: string      responses:        '200':          description: Successful response          content:            application/json:              schema:                type: object                description: |                  A map where the key refers to the three-letter currency code of the target currency and the value to the exchange rate to the target currency.                additionalProperties:                  type: number                  format: float

Creating the agent

Now let's create the agent in the Amazon Bedrock console following the steps below:

Select Agents in the left menu.
On the Agents page, click Create Agent.
In the Create Agent dialog, enter the following information and click Create:
- Name: ForexAssistant
- Description: An assistant that provides forex rate information.

On the Agent builder page, enter the following information and click Save:
- Agent resource role: Create and use a new service role
- Select model: Anthropic, Claude 3 Haiku
- Instructions for the Agent: You are an assistant that looks up today's currency exchange rates. A user may ask you what the currency exchange rate is for one currency to another. They may provide either the currency name or the three-letter currency code. If they give you a name, you may first need to first look up the currency code by its name.

Note that I try to provide concise instructions for the agent to help it reason up front. Depending on the test results, we might need to adjust it later with more prompt engineering.

Creating the action group

While still in the agent builder, we will create the action group that calls our APIs. Let's perform the following steps:

In the Action groups section, click Add.
On the Create Action group page, enter the following information and click Create:
- Enter Action group name: ForexAPI
- Description: The currency exchange rates API
- Action group type: Define with API schemas
- Action group invocation: Quick create a new Lambda function
- Action group schema: Define via in-line schema editor
- In-line OpenAPI schema:*Copy and paste the OpenAPI YAML from previous section*

After 15 seconds or so, you should receive a success message and be returned to the agent builder page. A dummy Lambda function should have been created, so our next step would be to add the logic to call the actual currency exchange rates API.

Updating the Lambda function to call the API

Let's go back into the action group page by clicking on the name of the action group (i.e. ForexAPI) in the list. In the edit page, click on the View button near the Select Lambda function field, which should take you to the function page in the Lambda console.

On the function page, you will see the code template that has been generated for you, which provides some basic processing of the input event and the response event.

After examining the input event format, we will recognize that the attributes that we need to use are:

apiPath, which should provide the path to the API as defined in the OpenAPI YAML (namely /currencies.min.json or /currencies/{code}.min.json).
httpMethod, which should always be get in our case. We thus won't make use of this attribute directly in our example.
parameters, which we need to provide for the rate lookup API which expects the code URI path parameter to be a three-level currency code.

I will spare you the gory details on writing the Lambda function, so here's is the code and some implementation details provided in comments:

import jsonimport urllib.parse # urllib is available in Lambda runtime w/o needing a layerimport urllib.requestdef lambda_handler(event, context):    agent = event['agent']    actionGroup = event['actionGroup']    apiPath = event['apiPath']    httpMethod =  event['httpMethod']    parameters = event.get('parameters', [])    requestBody = event.get('requestBody', {})    # Read and process input parameters    code = None    for parameter in parameters:        if (parameter["name"] == "code"):            # Just in case, convert to lowercase as expected by the API            code = parameter["value"].lower()    # Execute your business logic here. For more information, refer to: https://docs.aws.amazon.com/bedrock/latest/userguide/agents-lambda.html    apiPathWithParam = apiPath    # Replace URI path parameters    if code is not None:        apiPathWithParam = apiPathWithParam.replace("{code}", urllib.parse.quote(code))    # TODO: Use a environment variable or Parameter Store to set the URL    url = "https://cdn.jsdelivr.net/npm/@fawazahmed0/currency-api@latest/v1{apiPathWithParam}".format(apiPathWithParam = apiPathWithParam)    # Call the currency exchange rates API based on the provided path and wrap the response    apiResponse = urllib.request.urlopen(        urllib.request.Request(            url=url,            headers={"Accept": "application/json"},            method="GET"        )    )    responseBody =  {        "application/json": {            "body": apiResponse.read()        }    }    action_response = {        'actionGroup': actionGroup,        'apiPath': apiPath,        'httpMethod': httpMethod,        'httpStatusCode': 200,        'responseBody': responseBody    }    api_response = {'response': action_response, 'messageVersion': event['messageVersion']}    print("Response: {}".format(api_response))    return api_response

You can copy and paste this code into the editor and click Deploy to update it. At this point, we should test the Lambda function before returning to the Amazon Bedrock console. To do this, you can use the following event template to test the /currencies.min.json API (note that some irrelevant fields are omitted):

{    "messageVersion": "1.0",    "agent": {        "name": "TBD",        "id": "TBD",        "alias": "TBD",        "version": "TBD"    },    "inputText": "TBD",    "sessionId": "TBD",    "actionGroup": "TBD",    "apiPath": "/currencies.min.json",    "httpMethod": "get"}

You should see the success response with the list of currencies:

You can then use the following event template to test the /currencies/{code}.min.json API:

{    "messageVersion": "1.0",    "agent": {        "name": "TBD",        "id": "TBD",        "alias": "TBD",        "version": "TBD"    },    "inputText": "TBD",    "sessionId": "TBD",    "actionGroup": "TBD",    "apiPath": "/currencies/{code}.min.json",    "httpMethod": "get",    "parameters": [        {            "name": "code",            "type": "string",            "value": "usd"        }    ]}

You should see the success response with the list of exchange rates from US dollar to other currencies:

With the Lambda function verified, we can close the Lambda console and return to the Bedrock console to test the agent.

Testing the agent

It is imperative that we test the agent thoroughly to ensure that it provides accurate answers. Back to the agent builder, we need to click on the Prepare button to prepare it, which is required whenever the agent is changed. We can then test the agent using the built-in chat interface to the right of the console using the following prompt:

What is the forex rate from US Dollar to Japanese Yen?

Interestingly, I got the following response from the agent:

Sorry, I do not have the capability to look up the current forex rate from US Dollar to Japanese Yen. I can only provide a list of available currencies, but cannot retrieve the specific exchange rate you requested.

When I was validating the solution from scratch, the agent was able to return the correct answer. This could be caused by the model parameters that affects variability of responses among other things - the model is a bit of a black box after all! If you cannot reproduce this problem, try a few prompt sessions and ask the same question.

This is seemingly implying that the agent only knows of one API but not the other. So we need to troubleshoot the problem, which is where the ever-important trace feature come into play. The trace helps you follow the agent's reasoning that leads it to the response it gives at that point in the conversation.

When we show the trace using the link below the agent response, we can see the traces for each orchestration steps. There are four traces under the Orchestration and knowledge base tab:

Trace step 1 indicates the agent's rationale of first getting the currency code from the list then calling the /currencies/{code}.min.json API to get the rate, which seems correct. It is also able to call the /currencies.min.json API to get the list of currencies to look up the code. So far so good.
Trace step 2 indicates that it was able to get the currency code for US Dollar as USD, however we are not sure why it's in uppercase. It also indicates that get::ForexAPI::/currencies/USD.min.json is not a valid function, which is not true. It is unclear about the logic behind the decision.
Trace step 3 indicates that it is calling the /currencies.min.json API again for whatever reason. Lastly trace step 4 indicates that it cannot get the currency exchange rate and therefore gave up with the response we saw in the chat.

Since LLM is for the most part a black box, unfortunately we likely won't be able to get to the root cause. The only wild guess I could make is that the .min.json part is throwing it off because it doesn't resemble a normal RESTful API, so perhaps we can try to adjust the API specifications to remove that part.

Adjusting the API specs and re-testing

Let's make the adjustment in the OpenAPI YAML by stripping out the .min.json part from both API URLs:

openapi: 3.0.0info:  title: Currency API  description: Provides information about different currencies.  version: 1.0.0servers:  - url: https://cdn.jsdelivr.net/npm/@fawazahmed0/currency-api@latest/v1paths:  /currencies:    get:      description: |        List all available currencies      responses:        '200':          description: Successful response          content:            application/json:              schema:                type: object                description: |                  A map where the key refers to the three-letter currency code and the value to the currency name in English.                additionalProperties:                  type: string  /currencies/{code}:    get:      description: |        List the exchange rates of all available currencies with the currency specified by the given currency code in the URL path parameter as the base currency      parameters:        - in: path          name: code          required: true          description: The three-letter code of the base currency for which to fetch exchange rates          schema:            type: string      responses:        '200':          description: Successful response          content:            application/json:              schema:                type: object                description: |                  A map where the key refers to the three-letter currency code of the target currency and the value to the exchange rate to the target currency.                additionalProperties:                  type: number                  format: float

This will cause the agent to pass the API URL without the .min.json part to the Lambda function in the event, so we need to add it to the URL before calling the currency exchange rates API in line 27. The resulting Lambda code is thus:

import jsonimport urllib.parse # urllib is available in Lambda runtime w/o needing a layerimport urllib.requestdef lambda_handler(event, context):    agent = event['agent']    actionGroup = event['actionGroup']    apiPath = event['apiPath']    httpMethod =  event['httpMethod']    parameters = event.get('parameters', [])    requestBody = event.get('requestBody', {})    # Read and process input parameters    code = None    for parameter in parameters:        if (parameter["name"] == "code"):            # Just in case, convert to lowercase as expected by the API            code = parameter["value"].lower()    # Execute your business logic here. For more information, refer to: https://docs.aws.amazon.com/bedrock/latest/userguide/agents-lambda.html    apiPathWithParam = apiPath    # Replace URI path parameters    if code is not None:        apiPathWithParam = apiPathWithParam.replace("{code}", urllib.parse.quote(code))    # TODO: Use a environment variable or Parameter Store to set the URL    url = "https://cdn.jsdelivr.net/npm/@fawazahmed0/currency-api@latest/v1{apiPathWithParam}.min.json".format(apiPathWithParam = apiPathWithParam)    # Call the currency exchange rates API based on the provided path and wrap the response    apiResponse = urllib.request.urlopen(        urllib.request.Request(            url=url,            headers={"Accept": "application/json"},            method="GET"        )    )    responseBody =  {        "application/json": {            "body": apiResponse.read()        }    }    action_response = {        'actionGroup': actionGroup,        'apiPath': apiPath,        'httpMethod': httpMethod,        'httpStatusCode': 200,        'responseBody': responseBody    }    api_response = {'response': action_response, 'messageVersion': event['messageVersion']}    print("Response: {}".format(api_response))    return api_response

Once you updated both, you can prepare and test the agent again. Interestingly, we now get a proper response:

The current forex rate from US Dollar (USD) to Japanese Yen (JPY) is 1 USD = 158.3147925 JPY.

To ensure that the number is from the API and not other sources, we can review the agent's rationale from the trace like before. In trace step 2, we can see the right rationale and the invocation of the /currencies/{code} API with USD as the code parameter (again, not sure why it's in uppercase) as expected:

Hurray, we have successfully build a basic forex rate assistant using Agents for Amazon Bedrock! Naturally, you should test the agent extensively since LLMs are sometimes unpredictable and may require adjustments.

💡

In a follow-up blog post How To Manage an Amazon Bedrock Agent Using Terraform, I provide details on how to automate the deployment of this solution using Terraform. Please feel free to read it or refer directly to the configuration in this repository.

Testing reveals more limitations

For sake of experimentation, let's see what happens when we ask the assistant to do the reverse conversion. We will continue with the conversation in the same chat session and enter the following prompt:

What about the other way around?

The agent responded with the following:

The current forex rate from Japanese Yen (JPY) to US Dollar (USD) is 1 JPY = 0.0063163 USD.

However, a quick check shows that the number is a bit off. The response from https://cdn.jsdelivr.net/npm/@fawazahmed0/currency-api@latest/v1/currencies/jpy.json (at the time of writing) shows 0.0063165291 which is also what I got from the calculator for 1 / 158.3147925. Again, we will need to review the trace to see what the agent is up to. The trace revealed that the agent is doing an inverse calculation but the calculation is incorrect for some reason:

My expectation is that the agent should do another lookup from the API to get the right number. If the API were developed for a business and has a spread between the two exchange rates for profit, the agent would have given the wrong information. Putting that aside, the calculation is simply wrong.

After doing some reading online, it seems that LLMs in general are bad at math because their design is to predict words and not performing computations. So the exchange right 0.0063163 might just be a predication by Haiku based on the data that it was trained with.

Additional thoughts and summary

While we have built a functional forex rate assistant using Agents for Amazon Bedrock, it is certainly not production grade since it is not super accurate and it is a bit slow. Improving its accuracy is where the bulk of the effort for gen AI lies. AWS recommends the following strategies which developers should sequentially employ to improve their gen AI application:

For instance, my next iteration of improvement could start with adjusting the model inference parameters and prompt engineering, perhaps to ensure that it always calls the API instead of trying to do calculations. We also ought to look at why the LLM provide uppercase currency codes. Prompt engineering is admittedly more of an art and will require many rounds of trial and error, so be prepared for that.

I hope you learn something new from this blog post and has a better understanding of the features, potentials, and limitations of Agents for Amazon Bedrock. We are only scratching the surface here, so you are encouraged to use this forex agent as a start point for more improvements or develop your own agent. You would also need to expose the agent to end-users with a new frontend or an existing application. For me, the next step is to look into how to manage Bedrock agents using Terraform with the hot-off-the-press resources.

If you enjoyed this blog post, please be sure to check out other contents related to AWS and DevOps in the Avangards Blog. Thanks for your time and have fun with gen AI!

How To Manage Amazon GuardDuty in AWS Organizations Using Terraform

Anthony Wat — Tue, 23 Apr 2024 16:54:51 GMT

Introduction

Since I released the blog series How to implement the AWS Startup Security Baseline (SSB) using Terraform recently, I've received some feedback and questions on it. In particular, there were some questions around setting up GuardDuty in an organization using Terraform. Since the configuration involves multiple accounts and there are some quirks with the resources, I decided to write a separate blog post on how to properly implement it with explanation on each step.

About the use case

Amazon GuardDuty is a managed threat detection service that continuously monitors AWS accounts and workloads for malicious or unauthorized activity using machine learning, anomaly detection, and integrated threat intelligence.

GuardDuty supports managing multiple accounts with AWS Organizations via the delegated administrator feature, with which you designate an AWS account in the organization to centrally manage GuardDuty for all members. This is great for managing a multi-account landing zone by centralizing management of GuardDuty settings in a consistent manner.

Since it is increasingly common to establish an AWS landing zone using AWS Control Tower, we will use the standard account structure in a Control Tower landing zone to demonstrate how to configure GuardDuty in Terraform:

The relevant accounts for our use case in the landing zone are:

The Management account for the organization where AWS Organizations is configured. For details, refer to Managing GuardDuty accounts with AWS Organizations.
The Audit account where security and compliance services are typically centralized in a Control Tower landing zone.

The objective is to delegate GuardDuty administrative duties from the Management account to the Audit account, after which all organization configurations are managed in the Audit account. With that said, let's see how we can achieve this using Terraform!

Designating a GuardDuty administrator account

GuardDuty delegated administrator is configured in the Management account, so we need a provider associated with it in Terraform. To keep things simple, we will take a multi-provider approach by defining two providers, one for the Management account and another for the Audit account, using AWS CLI profiles as follows:

provider "aws" {  alias   = "management"  # Use "aws configure" to create the "management" profile with the Management account credentials  profile = "management" }provider "aws" {  alias   = "audit"  # Use "aws configure" to create the "audit" profile with the Audit account credentials  profile = "audit" }

Since GuardDuty is a regional service, you must apply this Terraform configuration on each region that you are using. Consider using the region argument in your provider definition and a variable to make your Terraform configuration rerunnable in other regions.

We can then use the aws_guardduty_organization_admin_account resource to set the delegated administrator. However, I noticed the following in the Audit account:

After this resource is created, GuardDuty will be enabled with both the foundational data sources and all protection plans enabled.
When the resource is deleted, GuardDuty remains enabled.

These side effects are not desirable since we would ideally want full control over the lifecycle and configuration of GuardDuty in Terraform. To address this issue, we will preemptively enable GuardDuty in the Audit account using the aws_guardduty_detector resource. We will also manage the protection plans using the aws_guardduty_detector_feature resource in subsequent steps after we define the org-wide settings.

The resulting Terraform configuration should be defined as follows (pay special attention to the provider argument in each resource):

data "aws_caller_identity" "audit" {  provider = aws.audit}resource "aws_guardduty_detector" "audit" {  provider = aws.audit}resource "aws_guardduty_organization_admin_account" "this" {  provider         = aws.management  admin_account_id = data.aws_caller_identity.audit.account_id  depends_on       = [aws_guardduty_detector.audit]}

With the Audit account designated as the GuardDuty administrator, we can now manage the organization configuration.

Configuring organization auto-enable preferences

GuardDuty distinguishes the foundational data sources settings from the protection plans settings. The former is managed using the aws_guardduty_organization_configuration resource. In our case, we want to manage GuardDuty for all accounts (i.e. both new and existing accounts). The resulting Terraform configuration should thus look like the following:

resource "aws_guardduty_organization_configuration" "this" {  provider                         = aws.audit  auto_enable_organization_members = "ALL"  detector_id                      = aws_guardduty_detector.audit.id  depends_on                       = [aws_guardduty_organization_admin_account.this]}

Next, let's manage the protection plan configuration. For illustration, let's assume that we only want to enable only EKS Audit Log Monitoring. To ensure full configurability, we will define the setting for all protection plans using a variable:

# Terraform configuration (.tf)variable "guardduty_features" {  description = "An object map that defines the GuardDuty organization configuration."  type = map(object({    auto_enable = string    name        = string    additional_configuration = optional(list(object({      auto_enable = string      name        = string    })))  }))}

# Variable definition (.tfvars)guardduty_features = {  s3 = {    auto_enable = "NONE"    name        = "S3_DATA_EVENTS"  }  eks = {    auto_enable = "ALL"    name        = "EKS_AUDIT_LOGS"  }  eks_runtime_monitoring = {    # EKS_RUNTIME_MONITORING is deprecated and should thus be explicitly disabled    auto_enable = "NONE"    name        = "EKS_RUNTIME_MONITORING"    additional_configuration = [      {        auto_enable = "NONE"        name        = "EKS_ADDON_MANAGEMENT"      },    ]  }  runtime_monitoring = {    auto_enable = "NONE"    name        = "RUNTIME_MONITORING"    additional_configuration = [      {        auto_enable = "NONE"        name        = "EKS_ADDON_MANAGEMENT"      },      {        auto_enable = "NONE"        name        = "ECS_FARGATE_AGENT_MANAGEMENT"      },      {        auto_enable = "NONE"        name        = "EC2_AGENT_MANAGEMENT"      }    ]  }  malware = {    auto_enable = "NONE"    name        = "EBS_MALWARE_PROTECTION"  }  rds = {    auto_enable = "NONE"    name        = "RDS_LOGIN_EVENTS"  }  lambda = {    auto_enable = "NONE"    name        = "LAMBDA_NETWORK_LOGS"  }}

The EKS_RUNTIME_MONITORING feature has been superseded by the RUNTIME_MONITORING feature, but to avoid perpetual differences in Terraform configuration, we must set its enablement state to NONE.

We can then use this variable with the for_each meta-argument with the aws_guardduty_organization_configuration_feature resource as follows:

resource "aws_guardduty_organization_configuration_feature" "this" {  provider    = aws.audit  for_each    = var.guardduty_features  auto_enable = each.value.auto_enable  detector_id = aws_guardduty_detector.audit.id  name        = each.value.name  dynamic "additional_configuration" {    for_each = try(each.value.additional_configuration, [])    content {      auto_enable = additional_configuration.value.auto_enable      name        = additional_configuration.value.name    }  }  depends_on = [aws_guardduty_organization_admin_account.this]}

Lastly, we will circle back to re-celebrating the protection plan settings for the Audit account itself. Let's piggyback on the same variable and use the aws_guardduty_detector_feature resource to achieve this:

resource "aws_guardduty_detector_feature" "audit" {  provider    = aws.audit  for_each    = var.guardduty_features  detector_id = aws_guardduty_detector.audit.id  name        = each.value.name  status      = each.value.auto_enable == "NONE" ? "DISABLED" : "ENABLED"  dynamic "additional_configuration" {    for_each = try(each.value.additional_configuration, [])    content {      status = additional_configuration.value.auto_enable == "NONE" ? "DISABLED" : "ENABLED"      name   = additional_configuration.value.name    }  }}

You can find the complete Terraform in the GitHub repository that accompanies this blog post.

With the complete Terraform configuration, you can now apply it to establish the Audit account as the delegated administrator and apply organization settings to all accounts in the target region. Note that it will take up to 24 hours for GuardDuty to automatically enable it in all accounts. YMMV, but it took about 3 hours in the evening in the Eastern time zone.

There is currently an issue where the additional_configuration block order causes differences when applying the Terraform configuration without making any changes.

Caveats about suspending GuardDuty in member accounts

Due to limitations with the GuardDuty Terraform resources, GuardDuty is unfortunately not automatically disabled when you run terraform destroy. Normally this wouldn't be a problem for a production landing zone. However, if you are only testing, this could lead to unexpected costs especially when GuardDuty is a somewhat costly service.

As a workaround, I would recommend using the AWS CLI or AWS SDK to at least suspend GuardDuty for all members using the StopMonitoringMembers API. For your convenience, you can use the following shell script to do so before running terraform destroy:

#!/bin/bash# Note: Make sure that you set the AWS_PROFILE environment variable to "audit" before running the script# Get the GuardDuty detector IDDETECTOR_ID=$(aws guardduty list-detectors --query DetectorIds[0] --output text)# Disable auto-enable organization membersaws guardduty update-organization-configuration --detector-id $DETECTOR_ID --auto-enable-organization-member NONE# Loop through each member account and disable GuardDutyMEMBER_ACCOUNTS=$(aws guardduty list-members --detector-id $DETECTOR_ID --query Members[*].AccountId --output text)for MEMBER_ACCOUNT in $MEMBER_ACCOUNTSdo  echo "Suspending GuardDuty for account $MEMBER_ACCOUNT"  aws guardduty stop-monitoring-members --account-ids $MEMBER_ACCOUNT --detector-id $DETECTOR_IDdone

Summary

In this blog post, you learned how to manage Amazon GuardDuty in AWS Organizations using Terraform. While there are some caveats, this allows you to streamline the setup of a security baseline for your AWS landing zone. The centralized approach to detective security can help you ensure compliance and timely reaction to security incidents.

I hope you found this blog post helpful. If you are interested in this type of content, be sure to check out other blog posts in the Avangards Blog. Thank you and have a great one!

How To Implement AWS SSB Controls in Terraform - Part 4

Anthony Wat — Thu, 11 Apr 2024 04:07:46 GMT

Introduction

The AWS Startup Security Baseline (SSB) defines a set of controls that comprise a lean but solid foundation for the security posture of your AWS accounts. By the end of part 3 of our blog series, we have covered all of the account controls and workload controls that are related to workload access and data protection. In this installment, we will review the remaining workload controls that focus on network security. Let's begin with WKLD.10, which is about keeping private resources secure within private subnets.

WKLD.10 Deploy private resources in private subnets

The workload control WKLD.10 requires that all AWS resources that don't require direct internet access be deployed to a VPC private subnet.

Amazon VPC provides the means to isolate your network and keep traffic from and to workloads in your VPCs secure. For a typical workload, resources such as backend and database systems should be deployed to private subnets. If a private subnet allows outbound access to the internet, its route table should route the to a NAT gateway deployed in a public subnet. This setup is shown in the following diagram:

To prevent accidental assignment of public IP addresses, this control recommends disabling the option for auto-assigning public IP address when a private subnet is created. In Terraform, this option corresponds to the map_public_ip_on_launch argument in the aws_subnet resource. Since the default value is false, omitting the argument in the resource definition will effectively create a private subnet. Here is an example:

# Dependent resources omitted for brevityresource "aws_subnet" "private" {  vpc_id     = aws_vpc.this.id  cidr_block = "10.0.0.0/24"  # Already false by default, no need to set explicitly  # map_public_ip_on_launch = false}

💡

The AWS VPC Terraform module provides abstraction for private subnets to help you manage a secure VPC structure.

Similar option to auto-assign public IP exists for EC2 instances, so it should be disabled when an EC2 instance is provisioned. This setting corresponds to the associate_public_ip_address argument in the aws_instance resource, which not set thus implying false by default. So you can omit the argument to create an EC2 instance as per the following example:

# Dependent resources omitted for brevityresource "aws_instance" "web" {  ami                  = data.aws_ami.ubuntu.id  iam_instance_profile = aws_iam_instance_profile.ssm.name  instance_type        = "t3.micro"  subnet_id            = data.aws_subnet.private.id  # Already false by default, no need to set explicitly  # associate_public_ip_address = false}

WKLD.11 Use security groups to restrict access

The workload control WKLD.11 requests the use of security groups to restrict network access.

Security groups serve as virtual stateful firewalls to control inbound and outbound traffic to the resource it is associated with. A typical baseline would be to allow all outbound traffic, and allow inbound traffic only to trusted sources on specific service ports and protocols. You can further restrict the outbound traffic for more isolation.

To demonstrate how to properly define security groups in Terraform, let's consider this hypothetical LAMP application:

Here are the required inbound rules:

Security group	Purpose	Source	Protocol and port
ALB	HTTPS access from the internet	Internet (0.0.0.0/0)	TCP 443 (HTTPS)
Web server	HTTPS access from the ALB	Security group of the ALB	TCP 443 (HTTPS)
MySQL Instance	MySQL access from web server	Security group of the web server	TCP 3306 (MySQL)

And here are the required outbound rules - note that we assume that the RDS for MySQL DB instance does not require internet access:

Security group	Purpose	Destination	Protocol and port
ALB	Internet access	Internet (0.0.0.0/0)	All
Web Server	Internet access	Internet (0.0.0.0/0)	All

The security groups can be defined in Terraform as follows:

# ALB security groupresource "aws_security_group" "alb" {  name        = "app-prod-sg-use1-alb"  description = "Security group for the ALB"  vpc_id      = aws_vpc.this.id}resource "aws_vpc_security_group_ingress_rule" "alb_https_all" {  security_group_id = aws_security_group.alb.id  cidr_ipv4         = "0.0.0.0/0"  from_port         = 443  ip_protocol       = "tcp"  to_port           = 443}resource "aws_vpc_security_group_ingress_rule" "alb_all" {  security_group_id = aws_security_group.alb.id  cidr_ipv4         = "0.0.0.0/0"  from_port         = -1  ip_protocol       = -1  to_port           = -1}# Web server security groupresource "aws_security_group" "web" {  name        = "app-prod-sg-use1-web"  description = "Security group for the web server"  vpc_id      = aws_vpc.this.id}resource "aws_vpc_security_group_ingress_rule" "web_http_alb" {  security_group_id            = aws_security_group.web.id  referenced_security_group_id = aws_security_group.alb.id  from_port                    = 443  ip_protocol                  = "tcp"  to_port                      = 443}resource "aws_vpc_security_group_ingress_rule" "web_all" {  security_group_id = aws_security_group.web.id  cidr_ipv4         = "0.0.0.0/0"  from_port         = -1  ip_protocol       = -1  to_port           = -1}# MySQL instance security groupresource "aws_security_group" "db" {  name        = "app-prod-sg-use1-db"  description = "Security group for the RDS for MySQL DB instance"  vpc_id      = aws_vpc.this.id}resource "aws_vpc_security_group_ingress_rule" "db_mysql_web" {  security_group_id            = aws_security_group.db.id  referenced_security_group_id = aws_security_group.web.id  from_port                    = 3306  ip_protocol                  = "tcp"  to_port                      = 3306}# Associate these security groups to the resources accordingly

💡

For more tips on managing security groups in Terraform, check out my blog posts 5 Tips to Efficiently Manage AWS Security Groups Using Terraform and Building a Dynamic AWS Security Group Solution With CSV in Terraform.

WKLD.12 Use VPC endpoints to access services

The workload control WKLD.12 recommends the use of VPC endpoints to privately access AWS and other services without traversing the internet.

Some industry and security compliance standards require that networks be isolated without outbound internet access. To facilitate access to AWS and external services without traversing the internet, you can use VPC endpoints.

There are two types of VPC endpoints - gateway endpoints and interface endpoints. Gateway endpoints are available only for Amazon S3 and Amazon DynamoDB, but are otherwise free to use. Interface endpoints support numerous AWS services and external services that are exposed and shared as endpoint services. Both endpoint types support resource policies, however only interface endpoints support security groups since they are deployed as ENIs.

Building upon the scenario above, let's assume that the web application needs to integrate with S3. We can provision a gateway endpoint in Terraform to enable private access as follows:

resource "aws_vpc_endpoint" "s3" {  vpc_id          = aws_vpc.this.id  route_table_ids = [aws_route_table.web.id]  service_name    = "com.amazonaws.us-east-1.s3"  # Defining a permissive resource policy for illustation.  # You can finetune it for more security.  policy = <<-EOT  {    "Version": "2008-10-17",    "Statement": [      {        "Action": "*",        "Effect": "Allow",        "Resource": "*",        "Principal": "*"      }    ]  }  EOT}

Let's now assume that we need to use SSM Session Manager to connect to the web server EC2 instance (see WKLD.06 in part 3 of the blog series for details). We can provision a set of interface endpoints in Terraform as follows:

locals {  ssm_service_names = [    "com.amazonaws.us-east-1.ec2messages",    "com.amazonaws.us-east-1.ssm",    "com.amazonaws.us-east-1.ssmmessages"  ]}resource "aws_security_group" "vpce" {  name        = "app-prod-sg-use1-vpce"  description = "Security group for interface endpoints to AWS services"  vpc_id      = aws_vpc.this.id}resource "aws_vpc_security_group_ingress_rule" "vpce_https_vpc" {  security_group_id = aws_security_group.vpce.id  cidr_ipv4         = aws_vpc.this.cidr_block  from_port         = 443  ip_protocol       = "tcp"  to_port           = 443}# Note: Requires enableDnsSupport and enableDnsHostnames set to true for the VPCresource "aws_vpc_endpoint" "ssm" {  for_each            = toset(local.ssm_service_names)  vpc_id              = aws_vpc.this.id  service_name        = each.key  vpc_endpoint_type   = "Interface"  private_dns_enabled = true  security_group_ids  = [aws_security_group.vpce.id]  # Alternatively you can create a subnet for VPC endpoints  subnet_ids          = [aws_subnet.web.id]}

WKLD.13 Require HTTPS for all public web endpoints

The workload control WKLD.13 mandates the use of HTTPS for all public web endpoints.

The HTTPS protocol provides the level of web security that is considered the norm nowadays, so much so that Google defines the use of HTTPS a ranking factor for their search results and mark websites using HTTP as "not secure" in Chrome. With the prevalence the zero trust and encryption everywhere security approaches, TLS encryption between the load balancers/reverse proxies and backend systems, as well as end-to-end encryption, are also strongly recommended.

AWS Certificate Manager (ACM) integrates with AWS endpoint services including Elastic Load Balancing and Amazon CloudFront. You can either issue a certificate for a domain that you own, or import a certificate that is generated with a third-party provider. For a better experience, you can use the AWS Certificate Manager (ACM) Terraform module to create and validate ACM certificates - refer to the examples for usages. Meanwhile, to import a certificate for which you have the necessary PEM-formatted key and certificate files, you can use the aws_acm_certificate resource as follows:

resource "aws_acm_certificate" "alb" {  private_key       = "${file("private.key")}"  certificate_body  = "${file("cert.cer")}"  certificate_chain = "${file("chain.cer")}"}

Continuing with the scenario above, here is the Terraform configuration that provisions an ALB with an HTTPS listener:

# Dependent resources omitted for brevityresource "aws_lb" "this" {  name               = "app-prod-alb-use1"  internal           = false  load_balancer_type = "application"  security_groups    = [aws_security_group.alb.id]  subnets            = [for subnet in data.aws_subnet.public : subnet.id]}resource "aws_lb_target_group" "web" {  name             = "web"  port             = 443  protocol         = "HTTPS"  target_type      = "ip"  vpc_id           = data.aws_vpc.this.id  health_check {    enabled  = true    matcher  = "200-299"    path     = "/"    protocol = "HTTPS"  }  lifecycle {    create_before_destroy = true  }}resource "aws_lb_target_group_attachment" "web" {  target_group_arn = aws_lb_target_group.web.arn  target_id        = aws_instance.web.id  port             = 443}resource "aws_lb_listener" "this" {  load_balancer_arn = aws_lb.this.arn  protocol          = "HTTPS"  port              = 443  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"  certificate_arn   = aws_acm_certificate.alb.arn  default_action {    type = "forward"    target_group_arn = aws_lb_target_group.web.arn  }}

For CloudFront, you can consider using the AWS CloudFront Terraform module. The complete example demonstrates how to configure a CloudFront distribute with HTTPS. Under the cover, it configures the custom origin and view certificate blocks in the aws_cloudfront_distribution resource.

WKLD.14 Use edge-protection services for public endpoints

The workload control WKLD.14 recommends using an edge-protection service to expose a public endpoint instead of directly through the underlying workload such as an EC2 instance.

Such endpoint services include Elastic Load Balancing and Amazon CloudFront as mentioned, as well as Amazon API Gateway and AWS Amplify Hosting. To provide additional endpoint protection, you can integrate services such as AWS WAF, AWS Network Firewall, and Gateway Load Balancer with a virtual firewall appliance.

Setting up AWS Network Firewall and Gateway Load Balancer can be complex especially in a centralized architecture, so we will save them for a future blog series on network security (hint hint). Resuming the sample scenario in this blog post, let's focus on configuring an AWS WAF web ACL and associating it with the ALB. Here is the Terraform configuration that creates a web ACL with the Core rule set (CRS) managed rule group:

resource "aws_wafv2_web_acl" "regional" {  name  = "app-prod-webacl-use1-regional"  scope = "REGIONAL" # Use GLOBAL for web ACL meant for CloudFront  default_action {    allow {}  }  visibility_config {    cloudwatch_metrics_enabled = true    metric_name                = "appapp-prod-webacl-use1-regional"    sampled_requests_enabled   = true  }  rule {    name     = "AWS-AWSManagedRulesCommonRuleSet"    priority = 0    override_action {      none {}    }    statement {      managed_rule_group_statement {      name        = "AWSManagedRulesCommonRuleSet"      vendor_name = "AWS"      # Use the rule_action_override block to override rule action    }    visibility_config {      cloudwatch_metrics_enabled = true      metric_name                = "AWS-AWSManagedRulesCommonRuleSet"      sampled_requests_enabled   = true    }  }}# Set up logging for analysis, an important part of WAF implementationresource "aws_cloudwatch_log_group" "waf_regional" {  name              = "aws-waf-logs-app-prod-webacl-use1-regional"  retention_in_days = 90}resource "aws_wafv2_web_acl_logging_configuration" "regional" {  log_destination_configs = [aws_cloudwatch_log_group.waf_regional.arn]  resource_arn            = aws_wafv2_web_acl.regional.arn  redacted_fields {    single_header {      name = "authorization"    }  }}# Associate the web ACL with the ALB to enable WAF protectionresource "aws_wafv2_web_acl_association" "regional" {  resource_arn = aws_lb.this.arn  web_acl_arn  = aws_wafv2_web_acl.regional.arn}

The AWS whitepaper Guidelines for Implementing AWS WAF does a great job at explaining how to plan, implement, test, and roll out AWS WAF, so be sure to review it before implementation.

WKLD.15 Use templates to deploy security controls

The final workload control, WKLD.15, recommends using infrastructure-as-code (IaC) and CI/CD pipelines to deploy security controls alongside your AWS resources.

If you are following this blog series, you should already know the benefits of using Terraform to define and deploy your AWS resources and configuration. Other IaC solutions such as AWS CloudFormation, AWS CDK, and Pulumi work the same way but differ in the programming or configuration language.

As you design your IaC templates, consider separating the SSB account controls into its own "stack" while incorporating the workload controls to the "workload stack". This allows a more flexible deployment model and practicing DevOps within application development teams.

Having CI/CD pipelines also helps you build and deploy configuration as soon as changes are made in the code repository. Your CI/CD pipeline can be customized according to your organization's needs, such as validations and approval gates. AWS CodePipeline is a decent choice if you prefer to stay in the AWS ecosystem, or you could also use a third-party solution such as GitHub Actions.

Since you will be providing AWS credentials to the CI/CD pipelines, it is crucial that they are set in a secure manner. For Terraform, the AWS provider documentation explains the different ways of providing AWS credentials to Terraform. You also should use a backend such as the S3 backend to securely store and share your states remotely.

Summary

Congratulations, we made it through the entire set of SSB controls! Over the course of this How to implement the AWS Startup Security Baseline (SSB) using Terraform blog series, we have reviewed every SSB control in detail and see how they can be implemented using Terraform. Keep in mind that these controls are still considered foundational, so you should evolve your security practice as your AWS usage scales and evolves. Frameworks such as the AWS Security Maturity Model can help you define new target states and implement them iteratively.

I thoroughly enjoyed writing this month-long blog series, especially as a new AWS Community Builder. I hope that you also learned something new and interesting. If you like my contents, please be sure to check out other posts in the Avangards Blog. Have a great one!

How To Implement AWS SSB Controls in Terraform - Part 3

Anthony Wat — Mon, 25 Mar 2024 02:24:41 GMT

Introduction

The AWS Startup Security Baseline (SSB) defines a set of controls that comprise a lean but solid foundation for the security posture of your AWS accounts. In part 1 and part 2 of our blog series, we examined how to implement account controls using Terraform. In this installment, we will look at the workload controls that focus on access to your workload infrastructure and protection of your data in AWS. Let's start with WKLD.01, which is about using IAM roles to provide AWS resources with access to other AWS services.

WKLD.01 Use IAM Roles for Permissions

The workload control WKLD.01 requires using IAM roles with all supported compute environments to grant them appropriate permissions to access other AWS services and resources.

The use of temporary or short-term credentials via IAM roles and identity federation is significantly more secure than long-term credentials such as IAM users and access keys, which could cause serious harm if they are compromised. In the case of AWS compute services, the instance typically assumes an IAM role using AWS Security Token Service (AWS STS) to generate temporary credentials to gain access as defined by the role permissions. The following is the list of common AWS compute services and the feature that supports IAM role assumption:

Service	Feature	Terraform resource and argument
Amazon EC2	Instance profile	`aws_iam_instance_profile` resource and `iam_instance_profile` in `aws_instance`
Amazon ECS	Task IAM role	`task_role_arn` in `aws_ecs_task_defintion`
Amazon EKS	IAM roles for service accounts (IRSA)	`iam-role-for-service-accounts-eks` module
Amazon EKS	EKS Pod Identities	`aws_eks_pod_identity_association` resource
AWS App Runner	Instance role	`instance_role_arn` in `aws_apprunner_service`
AWS Lambda	Lambda execution role	`role` in `aws_lambda_function`

In ACCT.04 in part 1 of the blog series, we have already looked at an example that assigns an execution role that allows fetching CloudWatch metrics and sending emails via SES to a Lambda function. It works very similarly for ECS tasks and App Runner services, so we won't provide more examples for brevity.

For EC2 instances, there is an additional step of creating an instance profile that is associated with the target IAM role and attaching it to the EC2 instances. Here is an example that enables SSM for an EC2 instance:

data "aws_ami" "ubuntu" {  most_recent = true  filter {    name   = "name"    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]  }  filter {    name   = "virtualization-type"    values = ["hvm"]  }  owners = ["099720109477"] # Canonical}data "aws_iam_policy" "ssm_managed_instance_core" {  name = "AmazonSSMManagedInstanceCore"}resource "aws_iam_role" "ssm" {  name = "SSMDomainJoinRoleForEC2"  assume_role_policy = jsonencode({    Version = "2012-10-17"    Statement = [      {        Action = "sts:AssumeRole"        Effect = "Allow"        Sid    = ""        Principal = {          Service = "ec2.amazonaws.com"        }      }    ]  })  managed_policy_arns = [    data.aws_iam_policy.ssm_managed_instance_core.arn  ]}resource "aws_iam_instance_profile" "ssm" {  name = aws_iam_role.ssm.name  role = aws_iam_role.ssm.name}resource "aws_instance" "web" {  ami                  = data.aws_ami.ubuntu.id  iam_instance_profile = aws_iam_instance_profile.ssm.name  instance_type        = "t3.micro"  subnet_id            = data.aws_subnet.private.id}

For EKS with IRSA, the iam-role-for-service-accounts-eks submodule of the terraform-aws-iam module can be useful especially if you use the terraform-aws-eks module to manage your EKS resources. Refer to the module documentation as linked above for details.

For EKS with Pod Identities, the blog post AWS EKS: From IRSA to Pod Identity With Terraform by Marco Sciatta provides a decent walkthrough on how to configure it in Terraform.

Don't forget to apply least privilege permissions as a best practice!

WKLD.02 Use Resource-Based Policies

The workload control WKLD.02 recommends using resource-based policies to provide additional access control at the resource level.

In the situation where both types of policies are set, the permission evaluation logic takes the union of allow permissions from the identity-based permissions of a user and the resource-based permission of the resource being access to determine the effective access level. Explicit deny permissions from either policies will take precedence to prevent access as usual.

Resource-based policies are especially effective when combined with policy conditions with global condition context keys such as aws:PrincipalOrgID, which specifies the organization ID of the principal.

Many AWS resources support resource-based policies - here are a few common examples:

Service	Feature	Terraform resource
Amazon S3	Bucket policy	`aws_s3_bucket_policy`
Amazon SQS	Queue policy	`aws_sqs_queue_policy`
AWS KMS	Key policy	`aws_kms_key_policy`

For the full list of AWS services that support resource-based policies, refer to the table in AWS services that work with IAM.

To define resource-based policies in Terraform, let's look at the following example that provisions an S3 bucket that uses a KMS customer-managed key (CMK) with both an S3 bucket policy and a KMS key policy:

data "aws_caller_identity" "this" {}locals {  account_id = data.aws_caller_identity.this.account_id}resource "aws_iam_role" "top_secret_reader" {  name = "TopSecretReaderRole"  assume_role_policy = jsonencode({    Version = "2012-10-17"    Statement = [      {        Action = "sts:AssumeRole"        Effect = "Allow"        Principal = {          AWS = "arn:aws:iam::${local.account_id}:root"        }      }    ]  })  inline_policy {    name = "ListAllMyBuckets"    policy = jsonencode({      Statement = [{        Version  = "2012-10-17"        Action   = "s3:ListAllMyBuckets"        Effect   = "Allow"        Resource = "*"        }      ]    })  }}resource "aws_kms_key" "this" {  description = "KMS key for S3 SSE"}resource "aws_kms_alias" "this" {  name          = "alias/s3_sse_key"  target_key_id = aws_kms_key.this.key_id}resource "aws_kms_key_policy" "this" {  key_id = aws_kms_key.this.id  policy = jsonencode({    Statement = [      {        Sid    = "Enable IAM user and AWS service permissions"        Action = "kms:*"        Effect = "Allow"        Principal = {          AWS = "arn:aws:iam::${local.account_id}:root"        }        Resource = "*"      },      {        Sid = "Allow use of the key"        Action = [          "kms:Encrypt",          "kms:Decrypt",          "kms:ReEncrypt*",          "kms:GenerateDataKey*",          "kms:DescribeKey"        ]        Effect = "Allow"        Principal = {          AWS = "arn:aws:iam::${local.account_id}:role/TopSecretReaderRole"        }        Resource = "*"      }    ]    Version = "2012-10-17"  })}resource "aws_s3_bucket" "this" {  bucket = "top-secret-bucket-${local.account_id}"}resource "aws_s3_bucket_server_side_encryption_configuration" "this" {  bucket = aws_s3_bucket.this.id  rule {    apply_server_side_encryption_by_default {      kms_master_key_id = aws_kms_key.this.key_id      sse_algorithm     = "aws:kms"    }  }}resource "aws_s3_bucket_policy" "this" {  bucket = aws_s3_bucket.this.id  policy = jsonencode({    Statement = [      {        Sid = "Allow listing of the bucket"        Action = [          "s3:ListBucket",          "s3:GetBucketLocation"        ]        Effect = "Allow"        Principal = {          AWS = "arn:aws:iam::${local.account_id}:role/TopSecretReaderRole"        }        Resource = aws_s3_bucket.this.arn      },      {        Sid = "Allow read of objects in the bucket"        Action = [          "s3:GetObject",          "s3:GetObjectVersion"        ]        Effect = "Allow"        Principal = {          AWS = "arn:aws:iam::${local.account_id}:role/TopSecretReaderRole"        }        Resource = "${aws_s3_bucket.this.arn}/*"      }    ]    Version = "2012-10-17"  })}

A role called TopSecretReaderRole has minimal permission assigned via an identity-based policy, so that we can defer access control via resource-based policy. The KMS key policy provides access to use the key for decryption, while the S3 bucket policy provides read-only access to the bucket and its objects that are encrypted with the same KMS key. To test the access, you can first upload a file to the S3 bucket with an IAM user/role that has access, then assume TopSecretReaderRole and verify that you can download the file from the S3 bucket.

Once again, ensure that you practice the least privilege principle when defining your resource-based policies.

WKLD.03 Use Ephemeral Secrets or a Secrets-Management Service

The workload control WKLD.03 recommends using either ephemeral secrets or a secrets-management service for applications in AWS.

There are two main AWS services that support secret management:

Secrets Manager is a comprehensive secrets management service that provides features such as secrets rotation, monitoring, and auditing for compliance. Many AWS services integrate with Secrets Manager to store and retrieve credentials and sensitive data.

Meanwhile, Parameter Store offers a simple option to store secrets alongside other related parameters. While it lacks features such as secrets rotation and have smaller size limit, Parameter Store is free to use and is a great option for storing application and service settings.

💡

It is imperative that you use the S3 backend or Terraform Cloud with appropriate security configuration to store the Terraform state remotely. Remote states are loaded into memory when Terraform runs, so the sensitive data that are stored in the state in plain text would not be persisted locally and risks exposure.

Personally I don't find it natural to manage secrets in Terraform and generally avoid it. I would instead create secrets outside Terraform and use the aws_secretsmanager_secret data source and the aws_secretsmanager_secret_version data source to retrieve secrets for use in resource arguments. For example:

data "aws_secretsmanager_secret" "fsx_init_admin_pwd" {  name = "aws/fsx/my-ontap-fs/initial-admin-password"}data "aws_secretsmanager_secret_version" "fsx_init_admin_pwd" {  secret_id = data.aws_secretsmanager_secret.fsx_init_admin_pwd.id}locals {  # Use local.fsx_init_admin_pwd to set the fsx_admin_password arg of the aws_fsx_ontap_file_system resource  fsx_init_admin_pwd = jsondecode(data.aws_secretsmanager_secret_version.fsx_init_admin_pwd.secret_string)["password"]}

As an additional reference, the AWS prescriptive guidance Securing sensitive data by using AWS Secrets Manager and HashiCorp Terraform provides some general best practices and considerations.

As for Parameter Store, I also take the same approach and would prefer managing secure parameters outside of Terraform and use the aws_ssm_parameter data source to retrieve them for use. Here is an example:

data "aws_ssm_parameter" "fsx_init_admin_pwd" {  name = "/fsx/my-ontap-fs/initial-admin-password"}locals {  # Use local.fsx_init_admin_pwd to set the fsx_admin_password arg of the aws_fsx_ontap_file_system resource  fsx_init_admin_pwd = data.aws_ssm_parameter.slack_token.insecure_value}

WKLD.04 Protect Application Secrets

The workload control WKLD.04 implores that you incorporate checks for exposed secrets as part of your commit and code review processes. This is outside the scope of Terraform, so we will move on to the next control.

WKLD.05 Detect and Remediate Exposed Secrets

The workload control WKLD.05 recommends deploying a solution to detect application secrets in source code.

Amazon CodeGuru Security, a feature of Amazon CodeGuru, is a static application security tool that uses machine learning to detect security policy violations and vulnerabilities. In particular, it can detect unprotected secrets. The service is "enabled" by configuring a CI pipeline for supported platforms, including GitHub, BitBucket, GitLab, and AWS CodePipeline. The third-party solutions require an OIDC provider and an IAM role to be created, which CodeGuru Security provides CloudFormation templates for. So you can either convert them into Terraform for deployment, or deploy them directly in Terraform using the aws_cloudformation_stack resource. Here is an example for GitHub integration:

resource "aws_cloudformation_stack" "codeguru_security_github" {  name         = "codeguru-security-github"  # The template URL is obtained   template_url = "https://codeguru-security-371921485547.s3-accesspoint.us-east-1.amazonaws.com/setup-github.yml"  parameters = {    Repository = "my-org/my-repo"  }  capabilities = ["CAPABILITY_NAMED_IAM"]}

The template URL is obtained in the AWS Management Console by clicking on the Open template in CloudFormation button once you select an integration, as shown in the screenshots below:

WKLD.06 Use Systems Manager Instead of SSH or RDP

The workload control WKLD.06 recommends the use of AWS Systems Manager Session Manager to securely access EC2 instances instead of placing them in a public subnet, or using a jump box or a bastion host.

With Session Manager, a user with the necessary IAM permissions can connect to an EC2 instance (to be precise, the SSM Agent running on the instance) using the browser-based shell in the AWS Management Console or the AWS CLI with the Session Manager plugin. This method only requires outbound traffic to the SSM service endpoints (either through the internet or VPC endpoints), and does not require enabling SSH and RDP traffic from the internet or from a jump box or bastion host. This results in better security and governance.

To enable Sessions Manager on an EC2 instance in Terraform, you need to create an instance profile with the AWS-managed policy AmazonSSMManagedInstanceCore attached and then attach it to the EC2 instance. A basic example is already provided in the WKLD.01 section above, so please refer to that. Note that this example assumes that the EC2 instance is deployed to a private subnet that routes outbound internet traffic to a NAT gateway in the same VPC.

If you prefer that EC2 instances communicate with SSM only within the AWS network, you can define the VPC endpoints required by SSM in the VPC and subnet where the EC2 instances reside. This can be achieved in Terraform using the aws_vpc_endpoint resource as follows:

# Example assumes that data.aws_vpc.this and data.aws_subnet.private are already defineddata "aws_region" "this" {}locals {  region = data.aws_region.this.name}resource "aws_security_group" "ssm_sg" {  name        = "ssm-sg"  description = "Allow TLS inbound To AWS Systems Manager Session Manager"  vpc_id      = data.aws_vpc.this.id  ingress {    description = "HTTPS from VPC"    from_port   = 443    to_port     = 443    protocol    = "tcp"    cidr_blocks = [data.aws_vpc.this.cidr_block]  }  egress {    description = "Allow All Egress"    from_port   = 0    to_port     = 0    protocol    = "-1"    cidr_blocks = ["0.0.0.0/0"]  }}resource "aws_vpc_endpoint" "ssm" {  vpc_id            = data.aws_vpc.this.id  subnet_ids        = [data.aws_subnet.private.id]  service_name      = "com.amazonaws.${local.region}.ssm"  vpc_endpoint_type = "Interface"  security_group_ids = [    aws_security_group.ssm_sg.id  ]  private_dns_enabled = true}resource "aws_vpc_endpoint" "ec2messages" {  vpc_id            = data.aws_vpc.this.id  subnet_ids        = [data.aws_subnet.private.id]  service_name      = "com.amazonaws.${local.region}.ec2messages"  vpc_endpoint_type = "Interface"  security_group_ids = [    aws_security_group.ssm_sg.id,  ]  private_dns_enabled = true}resource "aws_vpc_endpoint" "ssmmessages" {  vpc_id            = data.aws_vpc.this.id  subnet_ids        = [data.aws_subnet.private.id]  service_name      = "com.amazonaws.${local.region}.ssmmessages"  vpc_endpoint_type = "Interface"  security_group_ids = [    aws_security_group.ssm_sg.id,  ]  private_dns_enabled = true}

WKLD.07 Log Data Events for Select S3 Buckets

The workload control WKLD.07 recommends logging data events for S3 buckets that contain sensitive data in CloudTrail.

By default, data events are not captured in a CloudTrail trail and must be explicitly enabled. Since the volume of data events can be high depending on the access pattern of the resource, logging data events can get expensive quickly. It is therefore recommended that you log data events only for resources that contain sensitive data that warrants more scrutiny. Note that CloudTrail can also capture data events for other AWS services as listed in the AWS CloudTrail User Guide.

To demonstrate how to enable CloudTrail data event logging in Terraform, we will extend the basic example from ACCT.07 in part 2 of the blog series and capture data events from an S3 bucket using an advanced event selector. Here is the Terraform configuration:

data "aws_caller_identity" "this" {}data "aws_region" "this" {}locals {  account_id = data.aws_caller_identity.current.account_id  region     =  data.aws_region.this.name}resource "aws_s3_bucket" "top_secret" {  bucket = "top-secret-${local.account_id}-${local.region}"}# Note: Bucket versioning and server-side encryption are not shown for brevityresource "aws_s3_bucket" "cloudtrail" {  bucket = "aws-cloudtrail-logs-${local.account_id}-${local.region}"}resource "aws_s3_bucket_policy" "cloudtrail" {  bucket = aws_s3_bucket.cloudtrail.id  policy = <<-EOT{  "Version": "2012-10-17",  "Statement": [    {      "Sid": "AWSCloudTrailAclCheck",      "Effect": "Allow",      "Principal": {        "Service": "cloudtrail.amazonaws.com"      },      "Action": "s3:GetBucketAcl",      "Resource": "${aws_s3_bucket.cloudtrail.arn}"    },    {      "Sid": "AWSCloudTrailWrite",      "Effect": "Allow",      "Principal": {        "Service": "cloudtrail.amazonaws.com"      },      "Action": "s3:PutObject",      "Resource": "${aws_s3_bucket.cloudtrail.arn}/AWSLogs/${local.account_id}/*",      "Condition": {        "StringEquals": {          "s3:x-amz-acl": "bucket-owner-full-control"        }      }    }  ]}EOT}resource "aws_cloudtrail" "this" {  name                       = "aws-cloudtrail-logs-${local.account_id}-${local.region}"  s3_bucket_name             = aws_s3_bucket.cloudtrail.id  enable_log_file_validation = true  is_multi_region_trail      = true  advanced_event_selector {    field_selector {      field  = "eventCategory"      equals = ["Management"]    }  }  advanced_event_selector {    name = "Log all S3 objects events for the top secret bucket"    field_selector {      field  = "eventCategory"      equals = ["Data"]    }    field_selector {      field = "resources.ARN"      starts_with = ["${aws_s3_bucket.top_secret.arn}/"]    }    field_selector {      field  = "resources.type"      equals = ["AWS::S3::Object"]    }  }}

WKLD.08 Encrypt Amazon EBS Volumes

The workload control WKLD.08 encrypting all Amazon EBS volumes.

AWS provides server-side encryption to many services that store data, so that end-users can practice encryption at rest with minimal effort. For EBS volumes, there are two methods to encrypt EBS volumes:

Enable EBS encryption by default at the regional level.
Enable EBS volume encryption when the EBS volume is created either individually or as part of provisioning an EC2 instance.

In both cases, you can either use the AWS-managed key aws/ebs or provide your own KMS CMK. The latter supports seamless key rotation for added security and compliance.

To enable EBS encryption by default using Terraform, use the aws_ebs_encryption_by_default resource, and optionally the ebs_default_kms_key resource if you wish to use a KMS CMK, as follows:

💡

The KMS key policy is not defined in the Terraform configuration for brevity. Ensure that you define one as per the WKLD.02 section above.

resource "aws_kms_key" "ebs" {  description         = "CMK for EBS encryption"  enable_key_rotation = true}resource "aws_ebs_encryption_by_default" "this" {  enabled = true}# Optional - the AWS-managed key will be used if this resource is not usedresource "aws_ebs_default_kms_key" "this" {  key_arn = aws_kms_key.ebs.arn}

To encrypt the EBS volumes of an EC2 instance, you can set the encrypted argument and optionally the kms_key_id argument in the *_block_device configuration blocks in the aws_instance resource. Likewise, these arguments are also applicable to the aws_ebs_volume resource if you are provisioning EBS volumes individually. Here is a basic example that uses the default AWS-managed key for the root block device and an additional volume:

data "aws_ami" "ubuntu" {  most_recent = true  filter {    name   = "name"    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]  }  filter {    name   = "virtualization-type"    values = ["hvm"]  }  owners = ["099720109477"] # Canonical}resource "aws_instance" "app_server" {  ami           = data.aws_ami.ubuntu.id  instance_type = "t3.large"  subnet_id     = data.aws_subnet.private.id  root_block_device {    volume_type           = "gp3"    volume_size           = 20    delete_on_termination = true    encrypted             = true  }}resource "aws_ebs_volume" "app_server_data" {  availability_zone = "us-east-1a"  type              = "gp3"  size              = 50  encrypted         = true}resource "aws_volume_attachment" "app_server_data" {  device_name = "xvdb"  volume_id   = aws_ebs_volume.app_server_data.id  instance_id = aws_instance.app_server.id}

Note that you cannot update the Terraform configuration directly to encrypt an unencrypted EBS volume. Doing so will force a replacement of the resource and will destroy your data. You must follow a process such as what is described in this AWS re:Post KB article or implement an automated solution to first enable encryption, then reconcile your Terraform configuration.

WKLD.09 Encrypt Amazon RDS Databases

The workload control WKLD.09 requires encrypting all Amazon RDS databases.

Encrypting an RDS DB instance is very similar to the encrypting EBS volumes. You can either use the default AWS-managed KMS key aws/rds or supply a KMS CMK. In Terraform, you can set the storage_encrypted argument and optionally the kms_key_id argument in the aws_db_instance resource. Here is a basic example that uses the default AWS-managed key:

resource "aws_db_instance" "app_db" {  allocated_storage           = 50  db_name                     = "appdb"  engine                      = "mysql"  engine_version              = "8.0"  instance_class              = "db.t3.large"  manage_master_user_password = true  parameter_group_name        = "default.mysql8.0"  skip_final_snapshot         = true  storage_encrypted           = true  username                    = "mysqladm"}

You can only enable storage encryption of an RDS DB instance at the time of creation. To encrypt it afterwards, you must manually create a snapshot, create an encrypted copy of the snapshot, and restore it as a new DB instance. Then you can reconcile your Terraform configuration.

Summary

If you have followed this blog post this far, great job! This was a lot of information, but it is important to fully understand the best practices in controlling access and safeguarding your workload and their data. In the next and last installment of the blog series, we will wrap up with the remaining workload-level controls that focus on network security. Please look forward to it and check out other posts in the Avangards Blog.

How To Implement AWS SSB Controls in Terraform - Part 2

Anthony Wat — Sun, 17 Mar 2024 21:06:50 GMT

Introduction

The AWS Startup Security Baseline (SSB) defines a set of controls that comprises a lean but solid foundation for the security posture of your AWS accounts. In part 1 of our blog series, we examined how to implement account controls related to account-level and identity settings using Terraform. In this installment, we will look at the remaining account controls that focus on both proactive and preventive security and governance measures. Let's begin with ACCT.07, which mandates the CloudTrail log delivery to a protected S3 bucket.

ACCT.07 Log Events

The account control ACCT.07 requires that actions taken by users, roles, and services in your AWS account be recorded using AWS CloudTrail.

CloudTrail enables auditing, security monitoring, and operational troubleshooting by tracking user activity and API usage. Any API event that CloudTrail records can be used as an event source in Amazon EventBridge to trigger various automations. AWS does not charge for the first trail that records management events, making it a cost-effective choice to adopt.

You can create a CloudTrail trail using Terraform with the aws_cloudtrail resource. Since CloudTrail writes events to an S3 bucket, you also need to create one with the appropriate bucket policy. Here is a basic example:

data "aws_caller_identity" "this" {}data "aws_region" "this" {}locals {  account_id = data.aws_caller_identity.current.account_id  region     =  data.aws_region.this.name}# Note: Bucket versioning and server-side encryption are not shown for brevityresource "aws_s3_bucket" "cloudtrail" {  bucket = "aws-cloudtrail-logs-${local.account_id}-${local.region}"}resource "aws_s3_bucket_policy" "cloudtrail" {  bucket = aws_s3_bucket.cloudtrail.id  policy = <<-EOT{  "Version": "2012-10-17",  "Statement": [    {      "Sid": "AWSCloudTrailAclCheck",      "Effect": "Allow",      "Principal": {        "Service": "cloudtrail.amazonaws.com"      },      "Action": "s3:GetBucketAcl",      "Resource": "${aws_s3_bucket.cloudtrail.arn}"    },    {      "Sid": "AWSCloudTrailWrite",      "Effect": "Allow",      "Principal": {        "Service": "cloudtrail.amazonaws.com"      },      "Action": "s3:PutObject",      "Resource": "${aws_s3_bucket.cloudtrail.arn}/AWSLogs/${local.account_id}/*",      "Condition": {        "StringEquals": {          "s3:x-amz-acl": "bucket-owner-full-control"        }      }    }  ]}EOT}resource "aws_cloudtrail" "this" {  name                       = "aws-cloudtrail-logs-${local.account_id}-${local.region}"  s3_bucket_name             = aws_s3_bucket.cloudtrail.id  enable_log_file_validation = true  is_multi_region_trail      = true  advanced_event_selector {    field_selector {      field  = "eventCategory"      equals = ["Management"]    }  }}

If you are using AWS Organizations, you can create an organization trail in the management account to logs events for all accounts. In Terraform, an organization trail can be created by setting the is_organization_trail argument to true for the aws_cloudtrail resource. If you are using AWS Control Tower, a standard organization trail is created automatically when you launch your landing zone. You can import and manage it using Terraform thereafter.

ACCT.08 Prevent Public Access To Private S3 Buckets

The account control ACCT.08 requires the S3 Block Public Access feature to be enabled if public access is not required.

To ensure security by default, AWS enables Block Public Access by default for S3 buckets created on or after April 28, 2023. For S3 buckets that are created earlier, you may still need to enable the feature by yourself. In Terraform, you can use the aws_s3_bucket_public_access_block resource to configure the settings as appropriate. Here is an example that enables block public access for an S3 bucket:

data "aws_caller_identity" "this" {}locals {  account_id = data.aws_caller_identity.current.account_id}resource "aws_s3_bucket" "alb_access_logs" {  bucket = "alb-access-logs-${local.account_id}"}resource "aws_s3_bucket_public_access_block" "alb_access_logs" {  bucket                  = aws_s3_bucket.alb_access_logs.id  block_public_acls       = true  block_public_policy     = true  ignore_public_acls      = true  restrict_public_buckets = true}

If you are confident that all S3 buckets in your account do not require public access, you can also use the aws_s3_account_public_access_block resource to enable public public access at the account level as follows:

resource "aws_account_public_access_block" "alb_access_logs" {  block_public_acls       = true  block_public_policy     = true  ignore_public_acls      = true  restrict_public_buckets = true}

ACCT.09 Delete Unused Resources

The account control ACCT.09 requires that unused resources be deleted or disabled to reduce the opportunity for security issues.

In particular, the default VPC that is automatically created in each AWS account and enabled region should be considered for deletion. Default VPCs are created with public subnets that automatically assign IPv4 addresses, so novice AWS users could inadvertently expose private workloads to the internet. A multi-VPC environment with peering requirements also ought to use a well-defined CIDR allocation scheme other than the default 172.31.0.0/16 range. It is therefore recommended that you delete the default VPCs and create ones that are more thought out as necessary.

In Terraform, there are resources such as aws_default_vpc and aws_default_subnet which can technically be used to delete the default VPC resources with the force_destroy argument set to true. However, you would first have to define these resources in your Terraform configuration to "bring them in" before you can destroy them, making it a two-step process.

Alternatively, you can use the awsutils module from cloudposse to remove the default VPC resources more efficiently. The awsutils module provides the awsutils_default_vpc_deletion resource, which when defined in your Terraform configuration will deletes the default VPC along with the child resources of the VPC in the configured region, for example:

terraform {  required_providers {    awsutils = {      source = "cloudposse/awsutils"    }  }}provider "awsutils" {  region = "us-east-1"}resource "awsutils_default_vpc_deletion" "default" {}

However, it might be more efficient to simply write a shell script to delete the default VPC from all regions instead of using Terraform.

In the case where your AWS environment is created using AWS Control Tower's Account Factory, you can uncheck all regions so that the default VPC is not created in any of them. Account Factory for Terraform (AFT) also has an option to delete the default VPC if you are using that feature to dispense new accounts.

ACCT.10 Monitor Costs

The account control ACCT.10 requires cost monitoring and notification using services such as AWS Budgets.

AWS Budgets allows users to set custom budgets for AWS resource usage and sends notifications when actual or forecasted usage exceeds the budgeted amounts. A budget can be created in Terraform using the aws_budgets_budget resource. Here is an example configuration that creates budgets similar to the Monthly cost budget template:

resource "aws_budgets_budget" "this" {  name         = "My Monthly Cost Budget"  budget_type  = "COST"  limit_amount = "1000"  limit_unit   = "USD"  time_unit    = "MONTHLY"  notification {    comparison_operator        = "GREATER_THAN"    threshold                  = 85    threshold_type             = "PERCENTAGE"    notification_type          = "ACTUAL"    subscriber_email_addresses = ["finance@example.com"]  }  notification {    comparison_operator        = "GREATER_THAN"    threshold                  = 100    threshold_type             = "PERCENTAGE"    notification_type          = "ACTUAL"    subscriber_email_addresses = ["finance@example.com"]  }  notification {    comparison_operator        = "GREATER_THAN"    threshold                  = 100    threshold_type             = "PERCENTAGE"    notification_type          = "FORECASTED"    subscriber_email_addresses = ["finance@example.com"]  }}

Although it is not mentioned in the documentation for this control, I would also recommend that you also configure AWS Cost Anomaly Detection as an additional safeguard for cost overruns. This feature uses machine learning models to detect and alert on anomalous spending patterns in your deployed AWS services. You can create a cost monitor using the aws_ce_anomaly_monitor resource and subscriptions using the aws_ce_anomaly_subscription resource in Terraform. The following is an example that sets up a cost monitor and a daily summary alert when the cost is 50% above the expected spend.

resource "aws_ce_anomaly_monitor" "service" {  name              = "AWSServiceMonitor"  monitor_type      = "DIMENSIONAL"  monitor_dimension = "SERVICE"}resource "aws_ce_anomaly_subscription" "service_daily" {  name      = "DAILYSUBSCRIPTION"  frequency = "DAILY"  monitor_arn_list = [    aws_ce_anomaly_monitor.service.arn  ]  subscriber {    type    = "EMAIL"    address = "finance@example.com"  }  threshold_expression {    dimension {      key           = "ANOMALY_TOTAL_IMPACT_PERCENTAGE"      match_options = ["GREATER_THAN_OR_EQUAL"]      values        = ["50"]    }  }}

ACCT.11 Enable GuardDuty

The account control ACCT.11 recommends enabling Amazon GuardDuty to continuously monitor for malicious and unauthorized behavior to help protect against threats.

To enable GuardDuty in Terraform, use the aws_guardduty_detector resource to enable the service and the aws_guardduty_detector_feature resource to enable individual features. The following is a full example that enables all available protection features:

GuardDuty is a regional service and thus must be enabled in each region that you are using.

resource "aws_guardduty_detector" "this" {  enable                       = true  finding_publishing_frequency = "SIX_HOURS"}resource "aws_guardduty_detector_feature" "s3" {  detector_id = aws_guardduty_detector.this.id  name        = "S3_DATA_EVENTS"  status      = "ENABLED"}resource "aws_guardduty_detector_feature" "eks" {  detector_id = aws_guardduty_detector.this.id  name        = "EKS_AUDIT_LOGS"  status      = "ENABLED"}resource "aws_guardduty_detector_feature" "runtime" {  detector_id = aws_guardduty_detector.this.id  name        = "RUNTIME_MONITORING"  status      = "ENABLED"  additional_configuration {    name   = "EKS_ADDON_MANAGEMENT"    status = "ENABLED"  }  additional_configuration {    name   = "ECS_FARGATE_AGENT_MANAGEMENT"    status = "ENABLED"  }}resource "aws_guardduty_detector_feature" "malware" {  detector_id = aws_guardduty_detector.this.id  name        = "EBS_MALWARE_PROTECTION"  status      = "ENABLED"}resource "aws_guardduty_detector_feature" "rds" {  detector_id = aws_guardduty_detector.this.id  name        = "RDS_LOGIN_EVENTS"  status      = "ENABLED"}resource "aws_guardduty_detector_feature" "lambda" {  detector_id = aws_guardduty_detector.this.id  name        = "LAMBDA_NETWORK_LOGS"  status      = "ENABLED"}

If you have a multi-account landing zone that uses AWS Organizations or AWS Control Tower, you can use the aws_guardduty_organization_admin_account resource, the aws_guardduty_organization_configuration resource, and the aws_guardduty_organization_configuration_feature resource to configure GuardDuty at the organization level. Here is an example that configures GuardDuty with a delegated administrator to the Audit account (which you generally find in a Control Tower landing zone) and auto-enable GuardDuty for all member accounts in the organization:

provider "aws" {  alias   = "management"  profile = "management"}provider "aws" {  alias   = "audit"  profile = "audit"}data "aws_caller_identity" "audit" {  provider = aws.audit}locals {  audit_account_id = data.aws_caller_identity.audit.account_id}resource "aws_guardduty_organization_admin_account" "this" {  admin_account_id = local.audit_account_id}resource "aws_guardduty_detector" "audit" {  provider = aws.audit  enable   = true}resource "aws_guardduty_organization_configuration" "audit" {  provider                         = aws.audit  auto_enable_organization_members = "ALL"  detector_id                      = aws_guardduty_detector.audit.id}resource "aws_guardduty_organization_configuration_feature" "audit_s3" {  provider    = aws.audit  auto_enable = "ALL"  detector_id = aws_guardduty_detector.audit.id  name        = "S3_DATA_EVENTS"}

💡

Based on the feedback I received, I have written a new blog post How To Manage Amazon GuardDuty in AWS Organizations Using Terraform with more details on this topic. Feel free to read it if you are interested.

As an alternative to auto enablement, you can use the aws_guardduty_member resource to add GuardDuty members individually and use the aws_guardduty_invite_accepter resource at the member account to accept the invitation. Since the fully automated method is preferred, we won't go through an example for these resources.

GuardDuty is a costly service, so make sure that you review the pricing and understand how much each feature costs. You should leverage the 30-day free trial period to estimate the GuardDuty cost. This allows you to weigh the cost against the risks and requirements before deciding whether to enable GuardDuty and which protections to enable.

ACCT.12 Monitor High-Risk Issues

The account control ACCT.12 recommends using AWS Trusted Advisor to scan for high-risk or high-impact issues related to security, performance, cost, and reliability.

If you do not have a Business Support Plan or higher, you are only eligible for some basic security checks and service limit checks in Trusted Advisor. The Trusted Advisor API also cannot be used to enable automation such as refreshing check results and custom notification. So Trusted Advisor is restrictive and frankly not very useful at the free tier.

There is no Terraform resource for interacting with Trusted Advisor, so you need to configure notification in the AWS Management Console. There is however a trusted-advisor-refresh module that helps refresh the Trusted Advisor check results more often than the automatic one-week schedule if you have access to the Trusted Advisor API with a higher-tier Support Plan.

If you are using AWS Organizations or AWS Control Tower, you can enable organizational view in the management account. However the Support Plan requirement still applies to each member account, that is, you will not receive any additional checks in member accounts that do not have Business Support Plan or higher.

Summary

In this second blog post of the series How to implement the AWS Startup Security Baseline (SSB) using Terraform, we examined the remaining account-level controls and explained how you can implement them using Terraform. In the next installment, we will focus on the the workload-level controls for complete coverage. Please continue to follow the series and check out other posts in the Avangards Blog.

How To Implement AWS SSB Controls in Terraform - Part 1

Anthony Wat — Sun, 17 Mar 2024 21:06:23 GMT

Introduction

In my AWS consultant role, I have helped numerous organizations with building a landing zone or reviewing their existing AWS environment against best practices and recommend remediations, with a focus on security. Many of them happen to be in the early stage of adopting AWS due to a recent migration or starting a new business. These companies look to us in guiding them towards best practices when they have limited time and resources to learn on their own.

As a big fan of frameworks and prescriptive guidance from AWS, I often spend hours browsing through them for inspirations to improve my deliverables. Early on, I came across the AWS Security Baseline (SSB) which I found immensely suitable for the profile of customers I work with. I then started incorporating it into my toolset, particularly the Terraform templates that I deploy landing zones with. Given the effectiveness of the framework, I wanted to share my experience with the AWS community and explain how you too can add these security controls into your Terraform configuration.

Since I will be going through each control in details, I organized the write-up into a blog series consisting of four blog posts so that it is easier to follow. The first two will cover the account controls, while the latter two will cover the workload controls. With that said, let's first look at what the AWS SSB is about.

What is the AWS Startup Security Baseline (SSB)?

The AWS Startup Security Baseline (SSB) comprises a collection of controls designed to establish a foundational level of security for startups and organizations in early phase of adopting AWS, while maintaining their agility on AWS. There are two types of controls in the AWS SSB:

Account controls, which help keep your AWS account secure.
Workload controls, which help secure your workloads and data in the cloud.

These controls can be considered "low-hanging fruits" that are easy to implement but provides a decent level of security, which is perfect for organizations that are still navigating through the intricacy of managing an AWS environment and figuring out an operational model. These controls also carry over to a multi-account model as your organization's AWS usage matures.

If you practice DevOps and leverage Infrastructure-as-Code (IaC), you'd be please to learn that these controls can easily be implemented in IaC including Terraform which this blog series focuses on. We will dive into each control and how it can be defined in Terraform configurations. Let's start with the first account control ACCT.01, which covers setting account-level contacts to valid email distribution lists.

ACCT.01 Set Account-Level Contacts

The account control ACCT.01 requires that account-level primary and alternate contacts be set, ideally with email distribution lists.

Setting contacts is generally a one-time task, so it is often done in the AWS Management Console. However, it can also be done easily in Terraform using the aws_account_primary_contact resource and the aws_account_alternate_contact resource. The following example demonstrates how to set the primary contact and the alternate billing contact:

# Please don't fact check the contact info!resource "aws_account_primary_contact" "this" {  address_line_1     = "742 Evergreen Terrace"  city               = "Springfield"  company_name       = "Mr. Plow"  country_code       = "US"  district_or_county = "Pressboard Estates"  full_name          = "Homer Simpson"  phone_number       = "+16365553226"  postal_code        = "49007"  state_or_region    = "NT"  website_url        = "https://www.mrplow.com"}resource "aws_account_alternate_contact" "billing" {  alternate_contact_type = "BILLING"  name                   = "Marge Simpson"  title                  = "CFO"  email_address          = "finance@mrplow.com"  phone_number           = "+16365553226"}

💡

For a multi-account landing zone with many member accounts, consider using plus addressing (a.k.a. sub-addressing) to reduce the number of required email distribution list. You can either define plus addressing by account (for example, awsaccount1+billing@example.com, awsaccount1+security@example.com, etc.) or by function (for example, awsbilling+account1@example.com, awsbilling+account2@example.com, etc.)

ACCT.02 Restrict Use of the Root User

The account control ACCT.02 requires that the root user be put away from further use after all initial account setup activities are completed.

Root user configuration is typically done in the AWS Management Console, therefore we won't be using Terraform. You can refer to Root user best practices for your AWS account in the IAM User guide for more information. You may create additional administrative users for day-to-day use while following the controls that are described below.

ACCT.03 Configure Console Access

The account control ACCT.03 recommends using temporary credentials to grant access to AWS accounts and resources.

Most AWS users will just use IAM initially to manage access to their AWS environment. While there are resources to manage IAM users in Terraform, it is certainly not the best approach. IAM users are stateful resources which end-users will modify themselves (for example, changing password or adding an virtual MFA device). These out-of-band changes will cause perpetual changes in Terraform unless you leverage the ignore_changes lifecycle meta-argument, which is a chore to configure. As well, Terraform stores sensitive state data in plain text, which as you know is not very secure. For these reasons, it is recommended to leverage identity federation and single sign-on to manage users to your AWS environment.

💡

As your organization grows, you will find a multi-account architecture and identity federation to be increasingly valuable for better workload and environment separation. Consider using AWS Organizations and AWS IAM Identity Center even for your single account environment. You can create a new organizations and enroll your existing account as the member, then enable single sign-on (SSO) with IAM Identity Center. You can also start with AWS Control Tower or convert your organization into an AWS Control Tower landing zone later.

With that said, you can use the aws_iam_user resource to create the user if you must. For example:

resource "aws_iam_user" "john_doe" {  name = "john_doe"}

You can then use the aws_iam_user_login_profile resource to create the initial password:

resource "aws_iam_user_login_profile" "john_doe" {  user    = aws_iam_user.john_doe.name}output "john_doe_password" {  # encrypted_password is the base64-encoded password  # It provides a tad more security than the password attribute  value = aws_iam_user_login_profile.john_doe.encrypted_password}

You will also need to assign IAM policies to the user (or better yet, to a group to which the user is then assigned) as described in the next control.

ACCT.04 Assign Permissions

The account control ACCT.04 requires user permissions to be configured by assigning policies to their IAM identity following the least privilege principle.

IAM policies are fundamental constructs that define what a principal can access within an AWS environment. There are two types of IAM policies - AWS-managed and user-managed. In Terraform, AWS-managed policies can be used via the aws_iam_policy data source, for instance:

data "aws_iam_policy" "lambda_basic_exec_role" {  name = "AWSLambdaBasicExecutionRole"}

You can also use this data source to retrieve IAM policies that are created outside of your Terraform configuration. Meanwhile, managed policies can be created with the aws_iam_policy resource. The following is an example of a policy for a hypothetical Lambda function that processes some CloudWatch metrics and sends an email report via SES:

resource "aws_iam_role_policy" "cw_stats_email_lambda_exec" {  name        = "CWStatsEmailLambdaExecutionPolicy"  description = "Grants permissions to the CWStatsEmail Lambda function."  policy = jsonencode({    Version = "2012-10-17"    Statement = [      {        Action = [          "cloudwatch:GetMetricStatistics"        ]        Effect     = "Allow"        "Resource" = "*"      },      {        Action = [          "ses:SendEmail",          "ses:SendRawEmail"        ]        Effect     = "Allow"        "Resource" = "*"      }    ]  })}

How you use these IAM policies thereafter depends on the IdP. If you are using IAM directly, the best practice is to assign the policies to IAM groups or IAM roles as mentioned earlier. The assignment can be done in Terraform with the aws_iam_group_policy_attachment resource and the aws_iam_role_policy_attachment resource respectively. Using the same example above, the Terraform configuration below creates the Lambda execution role and attaches both the AWS-managed policy AWSLambdaBasicExecutionRole and the user-managed policy CWStatsEmailLambdaExecutionPolicy to the role:

data "aws_caller_identity" "this" {}data "aws_region" "this" {}locals {  account_id = data.aws_caller_identity.current.account_id  region     =  data.aws_region.this.name}resource "aws_iam_role" "cw_stats_email_lambda_exec" {  name        = "CWStatsEmailLambdaExecutionRole"  description = "Execution role for the CWStatsEmail Lambda function."  assume_role_policy = jsonencode({    Version = "2012-10-17"a     Statement = [      {        Action = "sts:AssumeRole"        Effect = "Allow"        Principal = {          Service = "lambda.amazonaws.com"        }        Condition = {          StringEquals = {            "aws:SourceAccount" = "${local.account_id}"          }          ArnLike = {            "aws:SourceArn" = "arn:aws:lambda:${local.region}:${local.account_id}:function:*"          }        }      }    ]  })  managed_policy_arns = [data.aws_iam_policy.lambda_basic_exec_role.arn]}resource "aws_iam_role_policy_attachment" "cw_stats_email_lambda_exec" {  policy_arn = aws_iam_policy.cw_stats_email_lambda_exec.arn  role       = aws_iam_role.cw_stats_email_lambda_exec.name}

If you are integrating an external OIDC or SAML IdP to IAM directly, the federated principal will also use IAM roles to access the AWS environment. The process to create IAM roles for this scenario is similar to the above, but you would need to adjust the trust policy (the assume_role_policy attribute) accordingly.

If you have set up identity federation using IAM Identity Center, permissions are assigned as IAM policies to permissions sets, which are then associated with users and groups for accounts in the organization. Permissions can be created in Terraform using the aws_ssoadmin_permission_set resource typically in the management account of the organization. Here is an example for setting up a permission set for network administrators:

data "aws_ssoadmin_instances" "" {}data "aws_iam_policy" "network_admin" {  name = "NetworkAdministrator "}resource "aws_ssoadmin_permission_set" "network_admin" {  name             = "MyOrgNetworkAdministrator"  description      = "Grants full access permissions to AWS services and actions required to set up and configure AWS network resources."  instance_arn     = tolist(data.aws_ssoadmin_instances.this.arns)[0]}resource "aws_ssoadmin_managed_policy_attachment" "network_admin" {  instance_arn       = tolist(data.aws_ssoadmin_instances.this.arns)[0]  managed_policy_arn = aws_iam_policy.network_admin.arn  permission_set_arn = aws_ssoadmin_permission_set.network_admin.arn}

The permission set can then be assigned to an IAM Identity Center user or group for an account in the organization using aws_ssoadmin_account_assignment resource.

ACCT.05 Require MFA

The account control ACCT.05 requires MFA to be enabled for AWS account access especially for long-term user credentials as a security best practice.

IAM users can enable MFA devices via the AWS Management Console given the appropriate permissions. For more information see Using multi-factor authentication (MFA) in AWS in the IAM User Guide.

In terms of Terraform, although the aws_iam_virtual_mfa_device resource can be used to provision an IAM virtual MFA device, user still needs to associate an actual device to their IAM user with the provided attributes base_32_string_seed or qr_code_png. Terraform is also not a particularly appropriate choice for managing IAM users due to the aforementioned security and procedural implications, so it's best left for separate management with a process that fits the IT security requirements of your organization.

If you have set up identity federation, MFA should be managed in the centralized IdP instead. For example, AWS IAM Identity Provider has MFA support, and you can expect enterprise-grade IdP such as Microsoft Entra ID to have advanced MFA capabilities and additional features such as Conditional Access.

ACCT.06 Enforce a Password Policy

The account control ACCT.06 requires that passwords adhere to a strong password policy, ideally one that aligns with the Center for Internet Security (CIS) Password Policy Guide, to help prevent discovery through brute force or social engineering.

The password policy for IAM users are set on the account. The following is a mapping of the CIS Password Policy Guide recommendations to the IAM password policy settings. You may tune them as you see fit.

	Password Only	With MFA
Password minimum length	14	8
Require at least one uppercase letter from the Latin alphabet (A-Z)	Yes	Yes
Require at least one lowercase letter from the Latin alphabet (a-z)	Yes	Yes
Require at least one number	Yes	Yes
Require at least one non-alphanumeric character	Yes	Yes
Password expires in `n` day(s)	365	365
Allow users to change their own password	Yes	Yes
Prevent password reuse from the past `n` changes	5	5

💡

The CIS Password Policy Guide argues that password composition requirements are not effective because it leads to users choosing predictable patterns that are prone to dictionary attacks for convenience. However, there is also no common standard. Since this is an opinionated, I chose to follow the default settings in IAM.

To configure the account password policy in Terraform, use the aws_iam_account_password_policy resource as follows:

resource "aws_iam_account_password_policy" "this" {  allow_users_to_change_password = true  max_password_age               = 365  minimum_password_length        = 8  password_reuse_prevention      = 5  require_lowercase_characters   = true  require_numbers                = true  require_symbols                = true  require_uppercase_characters   = true}

If you have set up identity federation, the password policy should primarily be managed in the centralized IdP. That being said, it never hurts to update the IAM account password policy as above just in case.

💡

As for enforcing MFA, I would recommend adopting a detective approach using services such as AWS Trusted Advisor (for root account MFA) or Amazon Security Hub with a standard that includes rules such as [IAM.5] MFA should be enabled for all IAM users that have a console password and [IAM.9] MFA should be enabled for the root user.

Summary

In this first blog post of the series How to implement the AWS Startup Security Baseline (SSB) using Terraform, we examined the account-level controls that pertain to account and identity and explained how you can implement them using Terraform. In the next installment, we will focus on the remaining account-level controls. Please continue to follow the series and check out other posts in the Avangards Blog.

PSA: Review Your AWS Public IPv4 Address Usage To Avoid New Charges

Anthony Wat — Wed, 06 Mar 2024 01:50:02 GMT

Introduction

As you might already know, the new charge for public IPv4 addresses has finally taken effect on February 1, 2024. All public IPv4 addresses, both idle and in use, now incurs a charge of US$0.005 per hour or US$3.65 per month. Many AWS users have previously enjoyed free public IPv4 addresses with running resources and thus paid less attention on the usage. However, the new cost could accumulate significantly if not monitored closely.

With this blog post, I hope to raise awareness of the impact from the new charge and reiterate some strategies that will help you review and reduce public IPv4 address usages. Let's first start by reviewing your environment with a convenient tool.

Review your public IPv4 address usage with public IP insights

As recommended in the AWS blog post New AWS Public IPv4 Address Charge + Public IP Insights, the Amazon VPC IP Address Manager (IPAM) provides a feature that helps you analyze your public IPv4 address usage. The public IP insights feature helps you visualize your usage by address type and by service for your AWS account (or your organization if configured).

Create a new IPAM

Getting started in the AWS Management Console is easy. You first need to open Amazon VPC IP Address Manager and create an IPAM for the first time. The following settings are recommended if you plan to use only the public IP insights feature for a consolidated view:

IPAM tier: Free Tier
Operating regions: Add all regions

After the IPAM is created, you can click Public IP insights from the left menu where you will be asked to wait for IPAM to populate your data. In my case, it took about 10 to 15 minutes for the data to show up for my single-account environment.

Review the public IP types and EIP usage charts

The Public IP insights page provides two donut charts to help you quickly identify how public IPv4 addresses are used in your environment. For example, here is a screenshot for my environment:

On the left, there is a donut chart that shows public IPv4 address allocation by type. The most relevant types are:

Service managed IPs - Public IPv4 addresses provisioned and managed by an AWS service.
Amazon-owned EIPs - Elastic IP (EIP) addresses that you have provisioned or assigned to resources in your AWS account.
EC2 public IPs - Public IPv4 addresses assigned to EC2 instances when the instances were launched into a default subnet or into a subnet thats been configured to automatically assign a public IPv4 address.

For my environment, I have one Amazon-owned EIP for a NAT gateway and two service managed IPs for an Application Load Balancer (ALB) that spans two AZs. As another example, I could also have an EC2 public IP which is associated with a bastion host deployed in the public subnet that is configured to automatically assign public IPv4 addresses.

Whereas the donut chart on the right indicates EIP usage by association status. Of particular interest are unassociated Amazon-owned EIPs which would incur a charge. These would be obvious candidates to delete, unless you have a reason to preserve the static IP address. I don't have any unassociated EIPs in my environment, thus the KPI shows 0 in the "donut hole".

Review the public IP address details

For more details pertaining to each assigned public IPv4 address, the Public IP addresses table at the bottom of the page comes in handy. As an example, here is a screenshot for my environment:

The table shows plenty of details that could help me identify how IPv4 addresses are allocated, so that I have a rough idea on where I stand.

Common IPv4 address usage optimization strategies

Once you know your public IPv4 address usage, you can do a bit of analysis and devise a mitigation plan to optimize cost. Below are some common scenarios and mitigation strategies that are worth reiterating.

Remove unassociated EIPs

Courtesy of Captain Obvious, if the EIP usage chart shows a non-zero number, you should review the unassociated EIPs and remove them if they are not needed. Even though you'd still have been charged before the new charge took effect, as a cost optimization best practice these EIPs should be unregistered unless you have legitimate reasons to retain them.

Use SSM Sessions Manager to connect to EC2 instances

AWS Systems Manager (SSM) Sessions Manager provides a secure and auditable means to connect to your EC2 instances via SSH or RDP within the AWS Management Console. Once an EC2 instance is configured with the proper outbound security group rules and IAM instance provide to allow SSM access, Sessions Manager can be used for shell access without a public IP address. This results in better security since your EC2 instance can reside in a private subnet, and it saves you from either assigning a public IP address to the instance or needing a bastion host (or jump box) which itself requires a public IP address.

Centralize inbound and outbound endpoint services

Depending on your architecture, you may be able to consolidate inbound endpoint services for your workloads. Both ALB and NLB supports multiple listeners and rules, while Amazon CloudFront can route to private ALBs and EC2 instances via AWS Global Accelerator.

Meanwhile, outbound internet access should be enabled using NAT gateways as per AWS best practices, instead of deploying all resources in public subnets and assigning public IP addresses.

It is important to note that these endpoint services consume IP addresses themselves on a per-AZ basis, so it is imperative that you deploy them optimally based on the actual high availability needs (for example, a 2-AZ setup may be sufficient to meet your SLA).

Disable public IPv4 address auto-assignment in your public subnets

In the past, public IPv4 addresses that are automatically assigned by public subnets are free, so many users may have enabled this feature out of convenience. Now that there is an associated cost, it is better to disable this option and assign EIPs on a case-by-case basis instead. You can check and disable the option in the (public) subnet settings as per the screenshot below:

As part of this, you may also identify EC2 instances that actually do not need to be in a public subnet, in which case you can consider moving them to a private subject (expect some work for this effort).

Consider using IPv6 instead (with caveats)

The main motivation behind the new IPv4 address usage fees is to discourage users from using the increasing-scarce IPv4 addresses, and instead start using IPv6 addresses. While this is a logical step forward, however not all AWS services can work exclusively with IPv6 addresses, nor are all end-users and applications ready for it. For example, Elastic Load Balancing still requires a dual-stack setup if IPv6 is required, and RDS requires IPv4 to operate. That being said, it is never too early to look into leveraging IPv6 as it will be an inevitable move as the world continues to be digitalized and exhausts the IPv4 address pool. For more details, refer to the AWS whitepaper IPv6 on AWS.

Summary

In this blog post, we outlined a general process to review your IPv4 address usages and different optimization methods so that you don't get charged unnecessarily. Many AWS users may still be unaware of this pricing change (I know some of my customers are), so I would appreciate if you can spread the word and send them a link to this blog post. With any luck, you can earn some brownie points or a beer for a fellow AWS user by saving them a few bucks :)

If you like this article, please check out my other blog posts for more helpful and intriguing content on AWS and DevOps. Thank you for reading and have a great one!

Working With Time Zones in an AWS Lambda Python Function

Anthony Wat — Sat, 02 Mar 2024 07:52:26 GMT

Introduction

Lately I have been working on automation solutions for sending summary reports of CloudWatch metrics. After figuring out how to best deploy an AWS Lambda layer, development was smooth sailing until I encountered an issue related to time zones. While the report, which contains hourly statistics for the past 24 hours, had the correct information when run on my laptop, the displayed hours were off by a few hours when tested as a Lambda function deployed in AWS. It was fortunately not difficult to resolve with some reading, but I figured that I could share a couple quick tips in this blog post to save a fellow engineer a few minutes.

Tip 1: Set the TZ environment variable for the Lambda function

As per the Using Lambda environment variables topic in the AWS Lambda Developer Guide, you can control the Lambda runtime time zone by setting the unreserved environment variable TZ. By default, the TZ environment variable is set to UTC, however you can set it to a standardized Internet Assigned Numbers Authority (IANA) time zone. For example, my customer resides in the East coast, so I set TZ to America/New_York:

💡

In Terraform, you can set environment variables using the environment configuration block in the aws_lambda_function resource.

After this, the Lambda function generates the report with the hours matching the customer's time zone. This is the simplest way of changing the time zone without code.

Tip 2: Use the datetime and zoneinfo modules properly

If you need to change the time zone of a datetime object (for example, one that is returned by the datetime.now() function), you should use the datetime.stimezone() function similar to the example below:

from datetime import datetimefrom zoneinfo import ZoneInfoimport os# Provide the IANA time zone as an env vardisplay_time_zone = os.environ.get("DISPLAY_TIME_ZONE")# Time zone = The value of the TZ env var (UTC if not set)current_dt = datetime.now() # Time zone = the time zone provided via the DISPLAY_TIME_ZONE env varcurrent_dt = current_dt.astimezone(ZoneInfo(display_time_zone))

The datetime.astimezone() function takes a ZoneInfo object as an argument, which you can construct with the IANA time zone identifier.

As a warning, do NOT use the datetime.replace() function to set the tzinfo attribute for this scenario, since it will only attach a time zone to the datetime object and not adjust the actual hours according to the time zone offset. So what you expect to be "now" displayed in the specified time zone would actually be off by the difference between the original and new time zone offsets, resulting in a skewed time frame.

Summary

With these two tips, you can effectively set the time zone of datetime objects in your Lambda function in Python. I hope you find this short blog post helpful and would encourage you to keep checking my blog for other useful information :)

5 Study Tips To Pass the GitHub Foundations Exam

Anthony Wat — Mon, 19 Feb 2024 06:48:14 GMT

Introduction

I have recently learned that GitHub Certifications went GA in January 2024 and decided to look into them. There are four available certifications covering various role-based topics:

GitHub Foundations
GitHub Actions
GitHub Advanced Security
GitHub Administrations

Since there was a 50% discount for the GitHub Foundations Certification exam (which won't be around for long), I decided to give it a try. As it is a "foundational" exam and GitHub is an intuitive platform overall, I naively put in only two days worth of preparation based on the recommended study resources from GitHub.

Well, I took the exam today and it was NOT a breeze. I got a passing score just barely (the passing score is about 70% based on 60 of 75 questions) and it hurt my ego a tiny bit :) But it also gave me the motivation to look back on my experience and share some insights on how you can prepare for the GitHub Foundations exam better than I did. With that said, let's now go through some tips!

Tip 1: Don't just rely on the officially recommended learning resources

Since this is a relatively new exam, there aren't as many study resources. So I used the GitHub Foundations learning path on MS Learn primarily to guide my study, which turned out to be insufficient.

While the learning path is well-structured, it probably covers only about 2/3 of everything listed in the study guide. Some topics also aren't covered to sufficient depth that some exam questions demand. Meanwhile, the Prepare for the GitHub Foundations Certification learning path on LinkedIn Learning also seems high-level on quick glance.

I've only realized that I was under-prepared after going through the exam study guide a couple hours before the exam, so I had to frantically review the GitHub documentation to supplement my knowledge. It was admittedly not very organized.

As more folks pass the exam and share study tips and resources on it (like I am doing here), I would encourage you to look for them on Google and in the r/github Subreddit to help you better prepare. I am already seeing some practice exams emerging, which will be helpful for validating your subject knowledge.

Tip 2: Go through each and every topic in the exam study guide

Don't let the MS Learn quizzes trick you into thinking that the exam is a walk in the park! The actual exam was quite comprehensive and covered most of the topics listed in the exam study guide. Some questions were also technical and required one to know the exact settings or commands.

I would recommend any prospective exam takers to go through each topic listed in the study guide and read up on them. GitHub Docs is the de facto source of information. I especially like how each topic typically provides, in addition to the detailed documentation, an Overview section to quickly learn about the subject. Here is a screenshot of the documentation page on authentication:

As a specific example, GitHub Codespaces overview covers the essential information about Codespaces and provides additional resources for the keen and the curious. Although I did not have time to go through them for this exam, I'll surely be doing so next time.

It is crucial that you go beyond surface-level understanding for better exam preparation, which the GitHub Docs will certainly help.

Tip 3: Go through the exercises in Microsoft Learn and GitHub Docs

To give credit where credit is due - the exercises on MS Learn are well-crafted and provide the hands-on experience that helps you build some muscle memory for your brain. The exercises are not terribly lengthy but still do a good job in teaching you the fundamentals of different GitHub features.

You will also find more exercises in GitHub Docs in the form of quickstarts. Similar to the aforementioned Overview section, the Quickstart section gives you step-by-step instructions to set up certain features within minutes for your evaluation. Here is a screenshot of the documentation page on GitHub Actions:

Using the same example earlier, the Quickstart for GitHub Codespaces helps you swiftly set up your own Codespace using a provided repository template.

Make sure that you incorporate hands-on exercises and allow sufficient time in your study plan as you prepare for the exam.

Tip 4: Explore and try things out on GitHub

If you are like me who learn best through experimentation, you would find it invaluable to just go at it on GitHub. The platform is already very accessible at the Free tier, so you can readily create new repositories and try out different features. There is also a 30-day free trial for GitHub Copilot, which aside from studying for the exam is also a great way to experience how generative AI can improve your productivity as a software practitioner. It's even free if you are a verified student, teacher, or maintainer of popular open source projects!

Perhaps the most effective way to learn is to gain practical experience if you use GitHub for work or open source projects. Lately I have been dipping my toe into Terraform AWS Provider contributions, from which I learned more how to use searches, workflows, pull requests, and open source contribution (which interestingly is a topic on the exam) on GitHub more effectively. Meanwhile, I also use GitHub for internal projects at work where I have more liberty to try out features like GitHub Actions, GitHub Copilot, and GitHub Codespaces with real-life scenarios. The experience helped me answer some exam questions quickly and confidently.

Either way, if you find yourself the motivation to explore and experiment with different features on GitHub, it will help tremendously with your study.

Tip 5: When in doubt, trust your gut

In my exam, I encountered a handful of questions on fundamental topics which I thought I am very proficient at, but the answers got me second guessing. One particular question asks to choose an answer that best fits the motivation of the design of a certain feature. Depending on how you read the question, two answers on the opposite ends both seem like the right answer!

Although these questions can be demotivating, the best course of action is to choose one answer and flag the question for latter review. In my case, it only took me only one hour to complete all questions, leaving me with a whole hour to review the flagged questions. The second pass on them was relatively quick after gaining more confidence from the "quick wins", leaving only a few harder questions for further contemplation. At that point, I just timeboxed myself a couple minutes for each question to make a choice and move on.

While my score wasn't high, I recognize that:

I can at best make educated guesses on topics that I did not prepare well for.
My gut is usually right based on my experience with AWS exams, where I usually end up with a significantly higher score than I expected.

So make sure that you have confidence of your knowledge and preparation going into the exam!

Summary

In this blog post, I have shared my recent experience on the new GitHub Foundations exam and some resulting tips on how you can better prepare for it. You'll be in a much better place with ample preparation that incorporates multiple study sources and reinforced learning through exercises and practical experience. You can find more information about these certifications here. I wish you best of luck with your studies and a worry-free exam. As for myself, I'd know better to prepare for the GitHub Actions exam, which is the one I am most interested in.

Be sure to also check out the Avangards Blog for more tips and articles about topics related to cloud and DevOps!

3 Ways To Publish and Use an AWS Lambda Layer in Terraform

Anthony Wat — Sun, 11 Feb 2024 08:19:12 GMT

Introduction

Last week, I was tasked with automating the delivery of a daily email report containing CloudWatch metrics summaries for a client. This involved performing max and mean calculations and presenting the data in a neatly formatted table. Opting for a Python-based Lambda solution seemed natural to me, leveraging the power of open-source libraries such as NumPy and tabulate. While coding and testing locally proceeded smoothly, upon testing the Lambda function on AWS, I encountered module import errors like the following:

[ERROR] Runtime.ImportModuleError: Unable to import module 'index': No module named 'numpy'

It was immediately apparent that the Lambda runtime lacked the libraries I need, so I searched for a solution which led to the use of Lambda layers. Along the way, I gathered some intriguing research insights, which prompted me to share them with the community through this blog post.

The problems that Lambda layer solves

The standard Lambda runtimes include the programming language runtime and core libraries, however many uses cases necessitates a need for additional third-party libraries. To address this extensibility challenge, AWS released the Lambda layer feature in 2018.

As explained in the AWS documentation, a Lambda layer is a zip file archive that contains supplementary code or data. Layers usually contain library dependencies, a custom runtime, or configuration files. While not demonstrated in the simple example in this blog post, a big advantage of using a layer is reusability as it can be shared across multiple functions.

About the sample scenario

For the purpose of this blog post, I have developed an example Lambda function that uses NumPy to create a list of 10 random integers between 1 and 100 and returns the average (mean) of the integers. It is far from a real-world scenario but is simple enough to illustrate the dependency on NumPy.

import numpy as npdef lambda_handler(event, context):    numbers = []    for i in range(10):        numbers.append(np.random.randint(1, 100))    avg = np.mean(numbers)    return {        'message': f'Average of 10 random numbers between 1 and 100: {str(avg)}'    }

After the Lambda function is deployed successfully, testing it in the AWS Management Console should yield results similar to the screenshot below:

The source code for the Lambda function and the associated Terraform configurations for all three methods are available in this GitHub repository. As you follow along and deploy the configuration for each method, be sure to verify the solution using the Lambda function.

Using this sample scenario, let's examine the three different methods of building and publishing a Lambda layer using Terraform.

Method 1: Use a layer from AWS Serverless Application Repository

Why build your own layer when someone else might have already done so? The first method, which is the simplest of the three, involves the use of a layer available in the AWS Serverless Application Repository. You can browse and search for an application, including layers, that have been shared publicly by the community or published by AWS. Behind the scenes, the application is deployed using CloudFormation.

While there is no available layer specifically for NumPy, there are a few layers for pandas, a data analysis and manipulation built on top of NumPy. In particular, AWS publishes a set of AWS SDK for pandas Lambda layers for different versions of Python. Since we are using Python 3.11, aws-sdk-pandas-layer-py3-11 is what we can use to indirectly access NumPy.

💡

For the full Terraform configuration of this method, refer to the 1_serverless_app_repo directory in the accompanying GitHub repository.

To deploy the layer, we can use the aws_serverlessapplicationrepository_cloudformation_stack Terraform resource as follows:

resource "aws_serverlessapplicationrepository_cloudformation_stack" "aws_sdk_pandas_layer" {  name           = "aws-sdk-pandas-layer-py3-11"  application_id = "arn:aws:serverlessrepo:us-east-1:336392948345:applications/aws-sdk-pandas-layer-py3-11"  capabilities = [    "CAPABILITY_IAM"  ]}

The application_id argument refers to the ARN of the application in the Serverless Application Repository, which can be found in the application page above. This resource exposes the outputs from the underlying CloudFormation stack via the outputs attribute. With aws-sdk-pandas-layer-py3-11, it provides a single output that represents the ARN of the deployed Lambda layer. Since outputs is a map, we will need to extract the value of the only value. The following snippet demonstrates how the value can be used in the layers argument of the aws_lambda_function resource:

resource "aws_lambda_function" "layer_example" {  function_name = "layer-example"  role          = aws_iam_role.layer_example.arn  description   = "Function demonstrating the use of a Lambda layer"  filename      = data.archive_file.layer_example_zip.output_path  handler       = "index.lambda_handler"  layers        = [[for k, v in aws_serverlessapplicationrepository_cloudformation_stack.aws_sdk_pandas_layer.outputs : v][0]]  runtime       = "python3.11"  # source_code_hash is required to detect changes to Lambda code/zip  source_code_hash = data.archive_file.layer_example_zip.output_base64sha256}

You should now be able to test the Lambda function layer-example and see that it returns a message with the average of the 10 random numbers.

Each SAR application's CloudFormation stack may have different outputs, so the way we extract the layer's ARN may not work all the time. Be sure to examine the output after deploying the application to find out which output is appropriate.

While it takes minimal effort, you are also limited by the availability of an application that has all the required libraries at a compatible runtime version that you need. In our example, AWS SDK for pandas contains extraneous libraries which we do not need, thus using it is an overkill. This layer will also not work if you require other libraries that are not part of the layer. This brings us to the next method where we can fully build and customize our own layer.

Method 2: Manually build the layer and publish it using Terraform

The second method involves building and publishing your custom layer. First we need to craft the zip file as per Packaging your layer content. For Python, one of the acceptable formats is a zip file that contains a python directory, which in turn contains the Python library dependencies. Following this README in the accompanying GitHub repository, we will create a requirements.txt file that specifies numpy as a dependency and use the following command to download the latest version of NumPy that is compatible with the standard Lambda runtime:

pip install --platform=manylinux_2_17_x86_64 --only-binary=:all: -r requirements.txt -t .

💡

for the full Terraform configuration of this method, refer to the 2_pip_manual directory in the accompanying GitHub repository.

After we create the zip that contains the python directory that has the downloaded NumPy distributable files, we can publish the Lambda layer in Terraform using the aws_lambda_layer_version resource as follows:

resource "aws_lambda_layer_version" "example_lambda_layer" {  description         = "Example lambda layer"  filename            = "${path.module}/lambda_layers/example-lambda-layer.zip"  layer_name          = "example-lambda-layer"  compatible_runtimes = ["python3.11"]  source_code_hash    = filebase64sha256("${path.module}/lambda_layers/example-lambda-layer.zip")}

We can subsequently deploy the Lambda function with this newly published layer as follows:

resource "aws_lambda_function" "layer_example" {  function_name    = "layer-example"  role             = aws_iam_role.layer_example.arn  description      = "Function demonstrating the use of a Lambda layer"  filename         = data.archive_file.layer_example_zip.output_path  handler          = "index.lambda_handler"  layers           = [aws_lambda_layer_version.example_lambda_layer.arn]  runtime          = "python3.11"  source_code_hash = data.archive_file.layer_example_zip.output_base64sha256}

You should now be able to test the Lambda function layer-example and see that it returns a message with the average of the 10 random numbers.

This method is great in that it includes only the libraries we need, keeping things nice and clean. However, it still takes some manual work so we ought to automate the process. Thus, we now look at the final method of implementing a pipeline for it.

My first attempt at testing this method by manually downloading NumPy from PyPI and extracting the whl file resulted in an ImportError when the Lambda function is run. This is due to the problem described in this GitHub issue, and the recommendation is to use pip to download the libraries properly using a similar command as I've used above.

Method 3: Build and publish the layer automatically using AWS CodeBuild

The last method takes a step further and use AWS CodeBuild to automatically build and publish the Lambda layer.

💡

for the full Terraform configuration of this method, refer to the 3_pip_codebuild directory in the accompanying GitHub repository. The 1_build subdirectory contains the Terraform configuration that creates the CodeBuild project, while the 2_deploy subdirectory contains the Terraform configuration that deploys the Lambda function. You must trigger a build manually after the CodeBuild project is completed before you apply the Terraform configuration in 2_deploy.

The CodeBuild project requires a buildspec file that describes the build process. The following example serves our purpose well:

version: 0.2phases:  install:    runtime-versions:      python: 3.11  build:    commands:      - mkdir python      - cd python      - |        cat <<'EOF' > requirements.txt        numpy        EOF      - pip install --platform=manylinux_2_17_x86_64 --only-binary=":all:" -r requirements.txt -t .      - cd ..      - zip -r9 $LAYER_NAME.zip python  post_build:    commands:      - test "$CODEBUILD_BUILD_SUCCEEDING" = "1"      - aws lambda publish-layer-version --layer-name $LAYER_NAME --description "Example Lambda layer" --compatible-runtimes python3.11 --zip-file fileb://${LAYER_NAME}.zip

As you can see, the build process creates the zip file in the build phase, and publishes the layer using an AWS CLI command in the post_build phase when the previous phase is successful. Now we can create the CodeBuild project with all necessary dependencies using Terraform using the snippet below:

resource "aws_iam_role" "example_lambda_layer" {  name = "ExampleLambdaLayerCodeBuildRole"  assume_role_policy = jsonencode({    Version = "2012-10-17",    Statement = [      {        Action = "sts:AssumeRole",        Effect = "Allow",        Principal = {          Service = "codebuild.amazonaws.com"        }      }    ]  })}resource "aws_iam_role_policy" "example_lambda_layer" {  name = "ExampleLambdaLayerCodeBuildPolicy"  role = aws_iam_role.example_lambda_layer.id  policy = jsonencode({    Version = "2012-10-17"    Statement = [      {        Action = [          "logs:CreateLogGroup",          "logs:CreateLogStream",          "logs:PutLogEvents"        ]        Effect   = "Allow"        Resource = "arn:${local.partition}:logs:${local.region}:${local.account_id}:log-group:/aws/codebuild/*"      },      {        Action = [          "lambda:PublishLayerVersion"        ],        Effect   = "Allow"        Resource = "arn:${local.partition}:lambda:${local.region}:${local.account_id}:layer:*"      }    ]  })}resource "aws_codebuild_project" "example_lambda_layer" {  name         = "example-lambda-layer-build"  description  = "Build project for example-lambda-layer"  service_role = aws_iam_role.example_lambda_layer.arn  artifacts {    type = "NO_ARTIFACTS"  }  environment {    compute_type                = "BUILD_GENERAL1_SMALL"    image                       = "aws/codebuild/amazonlinux2-x86_64-standard:5.0"    type                        = "LINUX_CONTAINER"    image_pull_credentials_type = "CODEBUILD"    environment_variable {      name  = "LAYER_NAME"      value = "example-lambda-layer"    }  }  source {    buildspec = file("${path.module}/codebuild/buildspec.yml")    type      = "NO_SOURCE"  }}

Once the CodeBuild project is created, you must manually start a build to publish the custom layer. Ensure that you check the build log after the run and confirm that all phases are successful:

Next, we deploy the Lambda function that uses the layer that was published via AWS CodeBuild. Since the deployment is now separate, we must use the aws_lambda_layer_version data source to look up the ARN:

data "aws_lambda_layer_version" "example_lambda_layer" {  layer_name = "example-lambda-layer"}

Then we can use the ARN when provisioning the Lambda function as follows:

resource "aws_lambda_function" "layer_example" {  function_name = "layer-example"  role          = aws_iam_role.layer_example.arn  description   = "Function demonstrating the use of a Lambda layer"  filename      = data.archive_file.layer_example_zip.output_path  handler       = "index.lambda_handler"  layers        = [data.aws_lambda_layer_version.example_lambda_layer.arn]  runtime       = "python3.11"  # source_code_hash is required to detect changes to Lambda code/zip  source_code_hash = data.archive_file.layer_example_zip.output_base64sha256}

You should now be able to test the Lambda function layer-example and see that it returns a message with the average of the 10 random numbers.

This method is the best of the three as it aligns well with DevOps methodologies.

Summary

In this blog post, we learned about three ways of building and publishing a Lambda layer from the simplest to the most flexible using Terraform. I hope you find this information helpful as you run into this common scenario and need a Infrastructure as Code (IaC) solution. Make sure to check out my other blog posts and learn more about other AWS and Terraform tips and tricks!

My Quest To Finding the Perfect AWS Resource Naming Scheme

Anthony Wat — Wed, 24 Jan 2024 18:33:21 GMT

Introduction

There are only two hard things in Computer Science: cache invalidation and naming things. -- Phil Karlton

I am sure that you've seen this famous quote in pretty much any article about naming conventions, and this blog post is no exception because the quote resonates to my experience as a cloud consultant. In the field, what I often see is that clients with smaller AWS environment do not have a consistent naming scheme due to ad-hoc usages, while larger clients who are looking to improve governance have a hard time defining an agreeable naming convention in the organization. In many cases, I'd ask the clients for resource naming preferences during engagements and they would turn the question around asking us for recommendation.

So, it is imperative that I can readily provide a naming convention that works for most cases but can also be customized to client needs as necessary. This calls for some R&D for a prescriptive naming scheme, which I now share my experience with you.

💡

If you wish to skip to the final recommended naming scheme, scroll down to the Adding my own tweaks for use in AWS section.

Defining the requirements

In my mind, a good naming scheme should exhibit the following traits:

It must be standardized with a well-defined schema which ideally can be parsed.
It must facilitate easy sorting, filtering, and identification of resources.
It must capture key information about the resource, such as usage and location.
It must adhere to limits such as case and length imposed by the cloud provider.

With these requirements, I set forth to find the best convention for naming AWS resources.

In search for cloud provider guidance

One thing which I find AWS usually does better than other cloud providers is information development. You'll always find helpful literatures and design patterns from AWS Blog and AWS Prescriptive Guidance, or learn something new from AWS Workshops. It is thus logical to search for resource naming recommendations from these AWS resources. However, there is eerily little information from AWS on resource naming for some reason. The only literature I found is an archived whitepaper on tagging best practices that recommended a dot notation .. (for example, prod.public-az1.subnet) which frankly is not super descriptive. The current version of the whitepaper doesn't even include the naming convention! As someone who prefers to follow the canon, I found this to be quite odd.

Being a multi-cloud practitioner, I then looked towards the next contender, Microsoft Azure, to provide more official guidance on resource naming. Indeed, the Cloud Adoption Framework for Azure has a comprehensive page on this topic. The framework recommends the naming convention ---- (for example, pip-sharepoint-prod-westus-001), which is better because it captures more hints that describe the purpose of the resource. The article goes on to list the key naming components, strategies such as scoping and abbreviation, and numerous resource-specific examples with more variations for comparison. While there aspects I can nitpick on, such as the ordering of naming components and the padded instance suffix, the general dash-separated format makes a lot of sense. Thus I considered this naming convention to be my baseline.

Scouring the community for inspirations

I also did more research on the internet for additional inspirations. Most Google search results refer to web pages and forum discussions recommending different variations of naming format similar to that of the Azure CAF, which is a good sign to being on the right track. That being said, I found one blog post named Cloud Naming Convention by Stepan Stipl to be especially thoughtful. Although the naming convention was developed for Google Cloud, it is very much applicable to other clouds. Stefan recommends the naming convention ------ with the following schema:

Component	Description	Required	Constraints
`prefix`	Fixed prefix	Yes	[a-z][a-z0-9]{3}
`project`	Project name	Yes	[a-z0-9]{4-10}
`env`	Environment	Yes	[a-z], from enum
`resource`	Resource type	Yes	[a-z]{3}, from enum
`location`	Resource location	No	[a-z0-9]{1,6}
`description`	Additional description	No	[a-z0-9]{1,20}
`suffix`	Random suffix	No	[a-z0-9]{4}

For details about the naming components, please read Stepan's blog post.

VPC: ste-blog-p-cne-primary
Subnet: ste-blog-p-csn-euwe1-primary

There is so much I like about this naming convention:

The format is lean and concise thanks to length limits and optional components.
Most important information is incorporated into the naming convention.
The components are ordered from general to specific, which helps with sorting and identifying resources just by eyeballing.
The length limits are compatible with that of resources for all cloud providers. The naming convention caters mostly to the limit of 63 characters for Google Cloud, but it is applicable also to Azure and AWS.

And so I adopted this naming convention as my new baseline.

Adding my own tweaks for use in AWS

What remains is to tweak the baseline from Stepan so that it works better for AWS resources. Albeit opinionated, here are the changes I made:

Consider prefix to be optional, since it's not super useful in common use cases.
Generalize project to also identify workload, application, or general usage. This allows one to quickly identify resources by category.
Allow a handful of characters for env , such as dev|qas|prd for a three-letter convention (SAP folks may recognize this three-system landscape naming) or dev|test|prod for a slightly more spelled-out variation, for better readability.
Use a short, common-sense name while referencing resource type in ARNs as needed for resource. ARNs typically have a service segment and a resource-type segment - see this GitHub gist courtesy of Cory Mawhorter for a list of example. For example, I would use sg for security groups and ssmdoc for SSM documents.
Use region abbreviation for location and combine it with an AZ suffix (a to f) for more efficient use of name real estate. I use this page from the Amazon S3 User Guide as the official source for the region abbreviations. For example, I would use use1a to represent the us-east-1 region's first AZ.

The resulting naming convention is ------ but with an updated schema as follows:

Component	Description	Required	Constraints
`prefix`	Fixed prefix	No	[a-z][a-z0-9]{3}
`usage`	Project, workload, application, or general usage	Yes	[a-z0-9]{4,10}
`env`	Environment	Yes	[a-z]{1,5}, from enum
`resource`	Resource type	Yes	[a-z0-9]{1,12}
`location`	Resource location (region + AZ)	No	[a-z0-9]{1,5}
`description`	Additional description	No	[a-z0-9]{1,20}
`suffix`	Random suffix	No	[a-z0-9]{4}

Here are a few examples of naming AWS resources using this scheme:

VPC: shrsvcs-prod-vpc-use1
Subnet: shrsvcs-prod-subnet-use1a-private
EC2: ecommerce-dev-ec2-usw1a-web
RDS: app-stg-rds-cac1-postgres-wp

The limits and special cases

While the naming convention is applicable to most resources, there are a few outliers which require different naming. Here are some cases that I tend to follow existing convention from AWS instead:

IAM resources are generally named by AWS in camel case with a descriptive name such as IAMRoleForReachabilityAnalyzerCrossAccountResourceAccess.
CloudWatch log groups are named in segments separated by / , such as /aws/clientvpn/networking-prod-clientvpn-cac1. Similar naming is employed for other resources such as AWS Secrets Manager secrets or SSM parameters.
S3 bucket names must be globally unique, thus I would add prefix using the company name abbreviation and suffix using the AWS account ID, for example, avg-app-prod-s3-use1-webassets-111111111111 .

This list is by no means exhaustive and you may find other cases that warrant a deviation from the general naming convention.

During my research, I came across the Terraform null label module that facilitates generation of consistent names and tags for resources. The naming convention is component-based and can be customized to match the recommended scheme above. This module is cloud agnostic and seems quite popular. Although the usage is clunky as one module definition is required per resource to be named or tagged, you can probably simplify it using the for_each meta-argument.

I also discovered that Azure provides an Azure Naming module as a helper to keep consistency on your resources names for Terraform. I am not sure how it fares in practice, but I am sharing this in case anyone is interested.

Summary

I have since used this custom naming convention on several AWS landing zone and migration/modernization projects with success. Clients appreciated a prescriptive naming convention that is logical, easy to follow, and often better than their existing scheme that was either defined in haste or inherited from their on-premises environments.

That being said, it is important to recognize that there is no one size that fits all, and you may have unique requirements that necessitates further customization. So long as you have a naming convention that is standardized and captures important metadata, it is as good as any other. Along with a good tagging strategy, a sensible resource naming convention will improve the governance and management of your AWS resources.

I hope that you find this blog post helpful and can adopt the suggested resource naming convention for your resource on AWS and even other cloud providers. Be sure to check out my other blog posts for more AWS and DevOps topics, and let me know what you'd like to learn more about!

Building a Dynamic AWS Security Group Solution With CSV in Terraform

Anthony Wat — Tue, 16 Jan 2024 17:48:44 GMT

Introduction

In my recent blog post about effective AWS security group management in Terraform, I delved into valuable tips based on my experiences. This exploration reignited my interest for a shelved side project from a past DevOps engagement.

The concept revolves around using a comma-separated value (CSV) file to manage security group settings, which offers a streamlined approach to deploying security groups via Terraform. This not only facilitates a centralized GitOps deployment model but also caters to security analysts less familiar with Terraform, simplifying their operational workflow.

Over the weekend, I spent some time to design and develop this solution. The journey was not without its challenges which required some creative problem-solving and experimentation. As I believe fellow DevOps engineers can benefit from this approach, it motivated me to document and share my insights in this blog post.

To enhance the post's readability, I'll explain concepts using code snippets. You'll find the fully runnable Terraform configurations in the accompanying GitHub repository, with each step/section conveniently highlighted. Ready to embark on this mini journey towards a CSV-based solution for security group management? Let's dive right into the design and implementation!

Defining the solution requirements

The main goal of the solution is to provide a simpler way of managing AWS security groups (and specifically the rules) with a CSV file instead of Terraform variables and configuration. Here are some specific requirements to ensure flexibility of the solution:

Use a single CSV file to manage all security groups of the Terraform stack.
Support for all source/destination types - IPv4 CIDRs, IPv6 CIDRs, prefix lists, and security groups.
Support for dynamic values (for example, using the ID from a security group resource that is provisioned in the same Terraform configuration).

For the purpose of explaining the solution, let's consider a target workload that is a traditional three-tier Linux, Apache, MySQL, PHP (LAMP) web application running on AWS with the following architecture:

To keep things simple, each security group for the resources (ALB, web server, MySQL instance) will have an egress rule that allows all outbound traffic. As for ingress rules, the requirements are as follows:

Resource	Source	Protocol and port
ALB	Internet (0.0.0.0/0)	TCP 443 (HTTPS)
Web server	ALB (10.0.0.0/24)	TCP 80 (HTTP)
MySQL instance	Web server (10.0.1.0/24)	TCP 3306 (MySQL)

As we develop the Terraform configuration in this blog post, we will focus only on creating the security group resources and a VPC resource which the security groups can be associated with. If you are interested in seeing a full solution in action, feel free to add the configuration to provision the network and workload resources at your leisure.

Defining the security group rule CSV file format

As per the requirements, the CSV file must support both ingress and egress rule definitions of all types for resources in the stack. The file format must adhere to the RFC 4180 specification, which is required by the Terraform csvdecode function that we will use in our Terraform configuration later. Considering the required attributes for the aws_vpc_security_group_ingress_rule and aws_vpc_security_group_egress_rule resources, the CSV file schema can be defined as follows:

Column	Description	Example
`resource_name`	The logical name of the resource to which the security group rules apply to.	`web`
`type`	One of: `ingress`, `egress`	`ingress`
`name`	The name of the rule.	`http-alb`
`description`	The description of the rule.	`Allow HTTP access from the public subnet CIDRs`
`cidr_ipv4`	The source or destination IPv4 CIDR range.	`10.0.0.0/24`
`cidr_ipv6`	The source or destination IPv6 CIDR range.	`2001:db8::`
`prefix_list_id`	The ID of the source or destination prefix list.	`pl-07b7b831714d4596a`
`referenced_security_group_id`	The source or destination security group that is referenced in the rule.	`sg-0023839dc98251128`
`ip_protocol`	The IP protocol name or number.	`-1` (all protocols), `tcp`
`from_port`	The start of port range for the TCP and UDP protocols, or an ICMP/ICMPv6 type.	`443`
`to_port`	The end of port range for the TCP and UDP protocols, or an ICMP/ICMPv6 code.	`443`

Since a security group rule expects only one of the four source or destination types, three of them would be optional for each rule. In the CSV file, we will leave those values empty, which will resolve to an empty string as you will see later.

To test the solution, let's define a file called sg-rules.csv with the following content that specifies all required rules for the ALB, web server, and MySQL instance. We will also start simple and use static values, specifically the subnet CIDR ranges, for the inbound rule sources.

resource_name,type,name,description,cidr_ipv4,cidr_ipv6,prefix_list_id,referenced_security_group_id,ip_protocol,from_port,to_portdb,ingress,postgres-web,Allow MySQL access from the private (web) subnet CIDRs,10.0.1.0/24,,,,tcp,3306,3306db,egress,all,Allow all outgoing traffic,0.0.0.0/0,,,,-1,-1,-1web,ingress,http-public,Allow HTTP access from the public subnet CIDRs,10.0.0.0/24,,,,tcp,80,80web,egress,all,Allow all outgoing traffic,0.0.0.0/0,,,,-1,-1,-1alb,ingress,https-all,Allow HTTPS access from the the internet,0.0.0.0/0,,,,tcp,443,443alb,egress,all,Allow all outgoing traffic,0.0.0.0/0,,,,-1,-1,-1

Terraform will not parse a CSV file correctly if it has a UTF-8 BOM format. Software such as Excel may create files with a byte order mark (BOM). Ensure that you convert the CSV file to ANSI format without BOM before providing it to Terraform.

Developing the basic Terraform configuration

Now that we have defined the CSV file format, let's write the Terraform configuration. For the base resource definitions, we will use something similar to what is explained in tip #2 of my previous blog post, for example:

# TODO: Adapt this to CSV inputresource "aws_vpc_security_group_ingress_rule" "web" {  for_each                     = var.web_security_group_rules.ingress  security_group_id            = aws_security_group.web.id  cidr_ipv4                    = try(each.value.cidr_ipv4, null)  cidr_ipv6                    = try(each.value.cidr_ipv6, null)  prefix_list_id               = try(each.value.prefix_list_id, null)  referenced_security_group_id = try(each.value.referenced_security_group_id, null)  from_port                    = each.value.from_port  ip_protocol                  = each.value.ip_protocol  to_port                      = each.value.to_port}

We will need to load the CSV file and use its content, which can be done using the csvdecode function and the file function in a local value. The loaded content will be a list of objects each representing a row in the CSV file. For column values that are not specified, they will be loaded as an empty string which we will need to convert to null in some arguments. Now we need to adapt it to a map which is easier to supply to the for_each meta-argument:

locals {  sg_rules_csv = csvdecode(file("${path.module}/sg-rules.csv"))  sg_rules     = { for e in local.sg_rules_csv : "${e.resource_name}-${e.type}-${e.name}" => e }}

To ensure uniqueness, we will use a combination of the resource name, rule type, and rule name as the map key. With a friendlier structure to for_each, let's update the rule resource definition to create rules based on the map. Here is an example of the web server security group ingress rule resources:

resource "aws_vpc_security_group_ingress_rule" "web" {  for_each                     = { for k,v in local.sg_rules : "${v.name}" => v if v.resource_name == "web" && v.type == "ingress" }  security_group_id            = aws_security_group.db.id  cidr_ipv4                    = try(each.value.cidr_ipv4 != "" ? each.value.cidr_ipv4 : null, null)  cidr_ipv6                    = try(each.value.cidr_ipv6 != "" ? each.value.cidr_ipv6 : null, null)  prefix_list_id               = try(each.value.prefix_list_id != "" ? each.value.prefix_list_id : null, null)  referenced_security_group_id = try(each.value.referenced_security_group_id != "" ? each.value.referenced_security_group_id : null, null)  from_port                    = each.value.from_port  ip_protocol                  = each.value.ip_protocol  to_port                      = each.value.to_port}

The value of for_each is the result of a for loop that filters for the ingress rules relevant to the web server. The destination attribute values are also updated to check for empty string and set the value to null.

You can find the complete Terraform stack up to this point in the basic directory of the GitHub repository that accompanies this blog post.

Now you can apply the Terraform configuration and see that it completes successfully. For good measure, verify in the AWS Management Console that the security groups are created with the correct set of rules (particularly the inbound rules that refer to the subnet CIDRs).

Adding variable support - first attempt

While referring to subnet CIDRs works, it does not offer the best security following the least privilege principle. For instance, other future workloads that are deployed to the web private subnet may be able to access the MySQL instance. As an improvement, the destination of the existing security group ingress rules should instead point to the appropriate workload security group.

Since the security groups are provisioned when the Terraform configuration is applied, their IDs are only known after they are created. It is certainly not desirable to manually copy the IDs into the CSV file afterwards, so we need to find a way to dynamically inject them. To address this, we can consider employing variable substitution.

As many Terraform practitioner knows, there is a templatefile function that can read a file while replacing template variables in the file content. (There is also a template_file data source but it is now considered deprecated.) Let's update the CSV file to use template variables to inject security group IDs in runtime like so:

resource_name,type,name,description,cidr_ipv4,cidr_ipv6,prefix_list_id,referenced_security_group_id,ip_protocol,from_port,to_portdb,ingress,postgres-web,Allow MySQL access from the private (web) subnet CIDRs,,,,${web_sg_id},tcp,3306,3306db,egress,all,Allow all outgoing traffic,0.0.0.0/0,,,,-1,-1,-1web,ingress,http-public,Allow HTTP access from the public subnet CIDRs,,,,${alb_sg_id},tcp,80,80web,egress,all,Allow all outgoing traffic,0.0.0.0/0,,,,-1,-1,-1alb,ingress,https-all,Allow HTTPS access from the the internet,0.0.0.0/0,,,,tcp,443,443alb,egress,all,Allow all outgoing traffic,0.0.0.0/0,,,,-1,-1,-1

We also need to update the local value that loads the CSV file as follows:

locals {  sg_rules_csv = csvdecode(templatefile("${path.module}/sg-rules.csv", {    "alb_sg_id" = aws_security_group.alb.id    "web_sg_id" = aws_security_group.web.id  }))  sg_rules     = { for e in local.sg_rules_csv : "${e.resource_name}-${e.type}-${e.name}" => e }}

You can find the complete Terraform stack up to this point in the dynamic_attempt_1 directory of the GitHub repository that accompanies this blog post.

All is seemingly well, however when we apply the Terraform configuration, it fails with a few errors similar to the one below:

Error: Invalid for_each argument   on main.tf line 38, in resource "aws_vpc_security_group_ingress_rule" "db":   38:   for_each                     = { for k, v in local.sg_rules : "${v.name}" => v if v.resource_name == "db" && v.type == "ingress" }           local.sg_rules will be known only after apply The "for_each" map includes keys derived from resource attributes that cannot be determined until apply, and so Terraform cannot determine the full set of    keys that will identify the instances of this resource. When working with unknown values in for_each, it's better to define the map keys statically in your configuration and place apply-time results only in the    map values. Alternatively, you could use the -target planning option to first apply only the resources that the for_each value depends on, and then apply a second time   to fully converge.

So what exactly are these errors about, and how do we fix them?

Fixing the for_each key issue and finalizing the solution

As the error message explained, for_each requires that the map keys be known during plan time. In fact, it is a frustratingly common problem that many Terraform practitioners have encountered. The limitation is also explained in the for_each documentation.

The problem is that the sg_rules map is derived from the sg_rules_csv local value, which is loaded using the templatefile function with template variable replacement. Due the replacement, the loaded CSV file content is no longer considered static (for instance, I could substitute entire rows into the content). Although to us, it should have been fair game because we are technically only replacing some values in a row instead of replacing entire rows.

Since we are sure that most of the contents, particularly the fields that comprises the key for sg_rules, we can build a static list off of it for looping while using the dynamically loaded CSV file content for all other information. For this, we will need some new local values:

locals {  sg_rules_csv = csvdecode(templatefile("${path.module}/sg-rules.csv", {    "web_sg_id" = aws_security_group.web.id    "alb_sg_id" = aws_security_group.alb.id  }))  sg_rule_names = [for e in csvdecode(file("${path.module}/sg-rules.csv")) : "${e.resource_name}-${e.type}-${e.name}"]  sg_rules      = { for e in local.sg_rules_csv : "${e.resource_name}-${e.type}-${e.name}" => e }}

Notice that there is now a new list called sg_rule_names, which contains the map key names, using the CSV file content loaded using the vanilla file function with has no variable substitution. Meanwhile, we keep the sg_rule_csv and sg_rules values the same. What's important is that sg_rule_names must contain the exact list of keys in sg_rules as we have coded.

We can now update the for_each value in the rule resources to iterate using the static list of keys in local.sg_rule_names while fetching rule settings from local.sg_rules below:

resource "aws_vpc_security_group_ingress_rule" "web" {  for_each                     = { for k in local.sg_rule_names : k => local.sg_rules[k] if startswith(k, "web-ingress") }  security_group_id            = aws_security_group.web.id  cidr_ipv4                    = try(each.value.cidr_ipv4 != "" ? each.value.cidr_ipv4 : null, null)  cidr_ipv6                    = try(each.value.cidr_ipv6 != "" ? each.value.cidr_ipv6 : null, null)  prefix_list_id               = try(each.value.prefix_list_id != "" ? each.value.prefix_list_id : null, null)  referenced_security_group_id = try(each.value.referenced_security_group_id != "" ? each.value.referenced_security_group_id : null, null)  from_port                    = each.value.from_port  ip_protocol                  = each.value.ip_protocol  to_port                      = each.value.to_port}

Note that we also need to use static values in the if condition in the for_each loop instead of attributes in the map values. While is it not the most elegant solution, the hardcoding is still passible because we know specifically the resource and type to which the resource block is applicable.

You can find the complete Terraform stack up to this point (which is also the final solution) in the final directory of the GitHub repository that accompanies this blog post.

With the updated configuration, the Terraform configuration can now be applied successfully! Please make sure that you verify the security groups and their rules in the AWS Management Console.

Maintaining the solution

In terms of maintenance, whenever you need to refer to new values that are derived from new resources such as:

Subnet CIDR blocks (for example, aws_subnet.private.cidr_block)
Managed prefix lists (for example, aws_ec2_managed_prefix_list.office_vpn.id)
Security groups (for example, aws_security_group.msk.id)

You will need to do the following, while paying attention to use unique IDs:

Define new template replacement variables (for example, ${subnet_private_cidr}, ${office_vpn_prefix_list_id}, and ${msk_sg_id} per above).
Update the list of variables in the tempatefile argument of the sg_rules_csv local value.

Summary

Congratulations, you have just built a CSV-based solutions for managing AWS security groups using Terraform! By employing a sensible design and naming scheme, as well as thoughtfully using Terraform functions and constructs, all rule settings can be maintained in a single CSV file. The same design can be extended to other cloud provider's corresponding concepts, such as network security groups in Azure.

If you find this solution helpful, please check out my other blog posts or let me know what you'd like to learn more about!

5 Tips to Efficiently Manage AWS Security Groups Using Terraform

Anthony Wat — Mon, 15 Jan 2024 12:00:19 GMT

Introduction

As a DevOps professional, I've helped numerous customers build landing zones on AWS and develop Infrastructure as Code (IaC) using Terraform. One of the questions I hear most often is on how to manage infrastructure security following the least privilege principle. For network security such as firewalls and security groups, this means having to define very specific rules for ports and CIDR ranges, which can quickly get out of hand as the number of resources and applications increase. Over the years, I have learned about using different strategies to minimize the maintenance overhead. In this blog post, I share these tips with a focus on security groups in hope that it helps my fellow AWS engineers and architects. Let's dive right in!

Tip 1: Use the standalone resources to manage security group rules

While the aws_security_group resource supports in-line rule definitions using the ingress and egress configuration blocks, this usage is considered deprecated due to various legacy limitations. The recommended approach nowadays is to define ingress and egress rules using the aws_vpc_security_group_ingress_rule and aws_vpc_security_group_egress_rule resources respectively. These standalone resources allow more fine-grained control over the security group rules. The following is a basic example for managing a security group for Linux-based bastion hosts:

resource "aws_security_group" "bastion" {  name        = "app-prod-sg-use1-bastion"  description = "Security group for bastion hosts"  vpc_id      = aws_vpc.this.id}# Ingress rule that allows SSH access from the office's external gateway IPresource "aws_vpc_security_group_ingress_rule" "bastion_ssh_office" {  security_group_id            = aws_security_group.bastion.id  cidr_ipv4                    = "1.2.3.4/32"  from_port                    = 22  ip_protocol                  = "tcp"  to_port                      = 22}# Egress rule that allows all accessresource "aws_vpc_security_group_egress_rule" "bastion_all" {  security_group_id            = aws_security_group.bastion.id  cidr_ipv4                    = "0.0.0.0/0"  from_port                    = -1  ip_protocol                  = -1  to_port                      = -1}

💡

If you are curious about the naming convention I for resources in the examples such as app-prod-sg-use1-bastion, check out my blog post My quest to finding the perfect AWS resource naming scheme!

Tip 2: Utilize map variables and for_each to minimize boilerplate configuration

Any sane person would find it unmanageable to maintain one big Terraform configuration file with tens and hundreds of security group rule resource blocks that look largely the same. This is where the for_each meta-argument can drastically reduce boilerplate code.

for_each takes a map or string set and creates an instance of a resource for each map or set item. For managing security group rules, you can store the rule information in a map and use for_each with the aws_vpc_security_group_ingress_rule and aws_vpc_security_group_egress_rule resources. Here is an example for managing a security group for a web server which is accessed by bastion hosts and Application Load Balancer that resides in a public subnet:

# Terraform configuration (.tf)variable "web_security_group_rules" {  description = "The security group rules for the web servers."  type = object({    ingress = optional(map(object({      cidr_ipv4   = string      from_port   = number      ip_protocol = string      to_port     = number    })), {})    egress = optional(map(object({      cidr_ipv4   = string      from_port   = number      ip_protocol = string      to_port     = number    })), {})  })}resource "aws_security_group" "web" {  name        = "app-prod-sg-use1-web"  description = "Security group for web servers"  vpc_id      = aws_vpc.this.id}resource "aws_vpc_security_group_ingress_rule" "web" {  for_each          = var.web_security_group_rules.ingress  security_group_id = aws_security_group.web.id  cidr_ipv4         = each.value.cidr_ipv4  from_port         = each.value.from_port  ip_protocol       = each.value.ip_protocol  to_port           = each.value.to_port}resource "aws_vpc_security_group_egress_rule" "web" {  for_each          = var.web_security_group_rules.egress  security_group_id = aws_security_group.web.id  cidr_ipv4         = each.value.cidr_ipv4  from_port         = each.value.from_port  ip_protocol       = each.value.ip_protocol  to_port           = each.value.to_port}

# Variable definition (.tfvars)web_security_group_rules = {  ingress_rules = {    "http-public-subnet" = {      cidr_ipv4   = "10.0.0.0/24"      from_port   = 80      ip_protocol = "tcp"      to_port     = 80    }    "https-public-subnet" = {      cidr_ipv4   = "10.0.0.0/24"      from_port   = 443      ip_protocol = "tcp"      to_port     = 443    }    "ssh-public-subnet" = {      cidr_ipv4   = "10.0.0.0/24"      from_port   = 22      ip_protocol = "tcp"      to_port     = 22    }  }  egress_rules = {    "all" = {      cidr_ipv4   = "0.0.0.0/0"      from_port   = 0      ip_protocol = -1      to_port     = 0    }  }}

The example can be extended to support other rule attributes as follows:

resource "aws_vpc_security_group_ingress_rule" "web" {  for_each                     = var.web_security_group_rules.ingress  security_group_id            = aws_security_group.web.id  cidr_ipv4                    = try(each.value.cidr_ipv4, null)  cidr_ipv6                    = try(each.value.cidr_ipv6, null)  prefix_list_id               = try(each.value.prefix_list_id, null)  referenced_security_group_id = try(each.value.referenced_security_group_id, null)  from_port                    = each.value.from_port  ip_protocol                  = each.value.ip_protocol  to_port                      = each.value.to_port}

This way, you can provide any one of the four source/destination types (IPv4 CIDR, IPv6 CIDR, prefix list, security group) in your map variable and have the rules dynamically created.

Meanwhile, those with more Terraform experience may have realized that this won't work very well in practice because variable values defined in tfvars files are static (that is, you cannot use a variable in another variable). This means you cannot specify, for instance, the ID of a prefix list resource created in the same configuration in your map variable. To address this, we could instead use a local value to define the rule configuration like this (replace var. with local. in the above configuration):

locals {  web_security_group_rules = {    ingress_rules = {      "http-alb" = {        referenced_security_group_id = aws_security_group.alb.id        from_port   = 80        ip_protocol = "tcp"        to_port     = 80      }      "https-alb" = {        referenced_security_group_id = aws_security_group.alb.id        from_port   = 443        ip_protocol = "tcp"        to_port     = 443      }      "ssh-public-subnet" = {        cidr_ipv4   = aws_subnet.public.cidr_block        from_port   = 22        ip_protocol = "tcp"        to_port     = 22      }    }    egress_rules = {      "all" = {        cidr_ipv4   = "0.0.0.0/0"        from_port   = 0        ip_protocol = -1        to_port     = 0      }    }  }}

With this design, all rule details are compactly defined in one place, thus improving maintenance and readability.

Tip 3: Use managed prefix lists to group CIDR blocks

In an enterprise environment, an AWS landing zone may have more sophisticated setup with:

Multiple availability zones to support high availability
Hybrid connectivity via VPN or Direct Connect
IP whitelisting for workload access

These features could increase the number of CIDR blocks that apply to security group rules and the complexity of your Terraform configuration. To combat this, you can take a consolidation approach using an often-overlooked VPC feature called the managed prefix list. A managed prefix list is a set of one or more CIDR blocks. You can use prefix lists to make it easier to configure and maintain your security groups and route tables.

The following is an example for a managed prefix list that groups the CIDR blocks for the public subnets that span multiple AZs:

resource "aws_ec2_managed_prefix_list" "public" {  name           = "app-prod-pl-use1-public"  address_family = "IPv4"  max_entries    = length(var.azs)}resource "aws_ec2_managed_prefix_list_entry" "public" {  # var.azs contains the list of AZs where public subnets exists (us-east-1a, us-east-1b, etc.)  for_each       = toset(var.azs)  cidr           = aws_subnet.public[each.key].cidr_block  description    = "CIDR block for the public subnet in AZ ${each.key}"  prefix_list_id = aws_ec2_managed_prefix_list.public.id}

You can then define security group ingress or egress rule resources with the prefix_list_id attribute set to aws_ec2_managed_prefix_list.public.id. As you can imagine, this is more manageable than having to define different sets of rules for however many AZs you are using. The solution is also useful for other aforementioned use cases such as keeping track of on-premises CIDR blocks or IP whitelists for restricted workloads.

Be aware of service quota issues related to the use of managed prefix lists. As described in the aws_ec2_managed_prefix_list resource documentation, the managed prefix list size is defined by the max_entries attribute, which counts towards the number of entries in a security group regardless of the actual number of entries. Ensure that you set max_entries to the exact number of entries or at least to a more conservative number, otherwise you may quickly hit the "Inbound or outbound rules per security group" service quota limit.

Tip 4: Consider roles and responsibilities when organizing security group resources

Your Terraform stack will inevitably become more complex and involve more cross-functional collaboration over time. In the context of security groups, you may encounter the following scenarios:

An IT security team manages and scrutinizes all that relates to firewalls, including security groups.
Different teams manage different aspects of a workload as a vertical. For example, DBAs may manage RDS resources while application development team manage EC2 and EKS resources.

Each scenario may necessitate different organizational strategies for your Terraform configuration for optimal efficiency. For example, you might want to organize resources by type (EC2, RDS, etc.) or by workload (Tableau, in-house application, etc.) In both cases, it makes sense to maintain the security group resources alongside the resources to which they are tied. Meanwhile, you might want to separate security group resources into its own file, so that the IT security team can focus on just that one file. Or you can combine both strategies to define a file structure like:

ec2.tfec2-sg-rules.tfrds.tfrds-sg-rules.tffs.tffs-sg-rules.tf

Having a good Terraform structure will drastically improve the developer experience for the end-users who are often not as well-versed in Terraform as a typical DevOps engineer.

Tip 5: Use a Terraform module from the community

For those who are looking for a turn-key solutions to better manage Terraform configurations, you may fancy the Terraform Registry and the plethora of modules that are available in it. Simply put, modules are containers for multiple resources that are used together. The following are the most popular community-developed modules for managing security groups on AWS:

These modules offer quality of life improvements by abstracting common configurations into simpler constructs and providing support for more complex scenarios. Chances are, these modules already employ some of the design patterns described in the earlier tips. Documentation for these modules are also decent, so you should be able to figure out how they work quickly.

On the flip side, you might find over time that these modules are not flexible enough to support your requirements because they are either too limited or are too opiniated. You could extend or work around module limitations, but that adds to the complexity which negates the benefits of using a module in the first place.

Due to the caveats, experienced Terraform practitioners may prefer building their own modules using vanilla Terraform constructs and resources. In practice, many of them have accumulated enough experience and reusable artifacts that developing custom modules is not terribly time-consuming.

I would recommend keeping an open mind and giving these modules a try first to see if they fit your need. They may very well be a big timesaver if you have mostly typical requirements or don't have a lot of R&D bandwidth. The key is to find the right balance and more importantly, figure out what works best for you and your team.

Bonus tip: Use a CSV-based solution

If you are interested in adopting a CSV-based solution to manage security groups in Terraform, take a look at my blog post Building a dynamic AWS security group solution with CSV in Terraform.

Summary

Effective use of Terraform features such as for_each and community modules, and AWS features such as managed prefix lists, will help you better manage security groups in AWS using Terraform.

As you gain more experience, you will also identify better ways to structure and develop your Terraform configuration based on your team's and organization's needs. Many of these tips also apply to general Terraform usages, so I hope you find this blog post helpful and can put the tips into practice.

Please also check out my other blog posts or let me know what you'd like to learn more about!

Migrating an S3 Bucket to Another AWS Account With S3 Replication

Anthony Wat — Sat, 27 May 2023 08:32:55 GMT

Introduction

Recently I've had the opportunity to assist a client transition from a single-account landing zone to a multi-account landing zone on AWS. As non-production workloads are moved to new accounts for better governance, some S3 buckets need to be moved. The client requires that the object versions and metadata (such as timestamps) be preserved since the context is important to the end users.

While there are multiple S3 data replication approaches, the only option that preserves last-modified timestamps is using Amazon S3 Replication. This feature takes care of the continuous replication of new objects from one S3 bucket to another. For one-time replication of existing objects, it can be achieved using S3 Batch Replication. One neat thing is that when you create a replication rule for the source S3 bucket, the wizard will provide you with an option to also create a batch job to replicate existing objects. Let's look at how to set this up with an example.

Procedures

For our scenario, we have an S3 bucket named replication-test-source-c7bwn2l4j in account A which we want to replicate to a new S3 bucket named replication-test-dest-fn7fcqjism in account B. Both buckets already have bucket versioning enabled, as it is a prerequisite for replication. Assuming the default, both buckets have server-side encryption with Amazon S3 managed keys (SSE-S3) enabled. Follow the steps below to configure the replication rule and batch replication job together:

Log in to account A.
In the S3 console, create a new bucket (for example, replication-report-m0htap9nqp) that will store the replication report generated by the batch replication job. This will help you troubleshoot in case the batch job fails. You don't need to enable bucket versioning on this S3 bucket.
Open the source bucket replication-test-source-c7bwn2l4j in the S3 console. Go the to Management tab and click Create replication rule.
On the Create replication rule page, enter the following information (leave all other settings as default) and click Save:
- Replication rule name: Replicate to replication-test-dest-fn7fcqjism
- Choose a rule scope: Apply to all objects in the bucket
- Destination bucket: Specify a bucket in another account
- Account ID: The ID of the AWS account where the destination bucket resides
- Bucket name: replication-test-dest-fn7fcqjism
- Change object ownership to destination bucket owner: Checked
- IAM role: Create new role
- Delete marker replication: Checked
When you are prompted whether you want to replicate existing objects, select Yes, replicate existing objects and click Submit.
On the Create Batch Operations job page, let's park it for now but don't close it, as we need to complete the setup here afterward.
In a new browser session, log in to account B where the destination S3 bucket resides.
Open the source bucket replication-test-dest-fn7fcqjism in the S3 console. Go to the Permissions tab and click Edit in the Bucket policy section.

Copy and paste the following policy JSON, replace the bucket name (###DEST_BUCKET_NAME###) and the source replication IAM role (###CRR_IAM_ROLE_ARN###), and click Save changes. You can find the source replication IAM role, which was created in step 4 above, in the IAM console in account A. The role would have a name with a standard prefix, which in our case is s3crr_role_for_replication-test-source-c7bwn2l4jg. Refer to official instructions from AWS for details.

 {    "Version":"2012-10-17",    "Id":"PolicyForDestinationBucket",    "Statement":[       {          "Sid":"Permissions on objects",          "Effect":"Allow",          "Principal":{             "AWS":"###CRR_IAM_ROLE_ARN###"          },          "Action":[             "s3:ReplicateDelete",             "s3:ReplicateObject"          ],          "Resource":"arn:aws:s3:::###DEST_BUCKET_NAME###/*"       },       {          "Sid":"Permissions on bucket",          "Effect":"Allow",          "Principal":{             "AWS":"###CRR_IAM_ROLE_ARN###"          },          "Action": [             "s3:List*",             "s3:GetBucketVersioning",             "s3:PutBucketVersioning"          ],          "Resource":"arn:aws:s3:::###DEST_BUCKET_NAME###"       }    ] }

Go back to the original browser session which we left off at step 6. On the Create batch replication job page, enter the following information (leave all other settings as default) and click Save:
- Path to completion report destination: Browse for replication-report-m0htap9nqp which was created in step 2
After being redirected to the Batch Operations page, wait for the batch replication job to complete. It may take from minutes to hours depending on how many existing objects there are in the source S3 bucket and their sizes. Ensure that % Complete is 100% and Total failed (rate) is 0 (0%).
Optionally, upload a new file/object to replication-test-source-c7bwn2l4j in account A. Then wait a few minutes and check whether the file is replicated to replication-test-dest-fn7fcqjism in account B.

We have now successfully fully replicated the source S3 bucket replication-test-source-c7bwn2l4j in account A to the destination bucket replication-test-dest-fn7fcqjism in account B. Since the purpose is to migrate the bucket, delete the source bucket and ensure that any consumers (end-users, applications, services, etc.) update their references to point to the new (destination) S3 bucket instead.

Additional thoughts and summary

It might be obvious, but S3 bucket names must be unique across all AWS accounts in all the AWS Regions within a partition as per the bucket naming rules. Consequently, you won't be able to reuse the bucket name. If you insist on reusing the bucket name, you could try to first replicate the original bucket to a temporary bucket, then delete the original bucket. After a few hours, the original bucket name might become available for use again. You can then create the destination bucket in the target account using the original name and complete the replication/migration. Be warned that there is NO guarantee that the original name would be made available again by AWS!

Following this method, I was able to migrate all my client's S3 buckets successfully. I also used the same procedures (without the cross-account bucket policy) to enable cross-region backup for a production S3 bucket. I hope you find this blog post useful should you ever find yourself in need of migrating S3 buckets.

Kubernetes-Native CI/CD With Tekton and ArgoCD

Anthony Wat — Fri, 31 Dec 2021 01:30:12 GMT

Introduction

Having learned about the DigitalOcean Kubernetes Challenge, I decided to take it on to upskill myself on trending DevOps tools for Kubernetes, and to evaluate DigitalOcean's container services. For the challenge, I implemented a Kubernetes-native CI/CD solution using Tekton build and ArgoCD for deployment. In this article, I will share my experience and sample code I've put together, so you too can try it out.

About the sample code

You can find the sample code for the challenge at https://github.com/acwwat/do-k8s-gitops. The folder structure is as follows:

The app folder contains the source code for the sample to-do list application that we will build into Docker images and deployed to Kubernetes. The frontend todo-list-frontend is a basic Vue.js application and the backend todo-list-backend is a basic Java Spring Boot application with an embedded H2 database.
The config folder contains the starter Kubernetes YAML file template which we will use to deploy the application to the Kubernetes cluster.
The tekton folder contains the Tekton Kubernetes YAML files that defines the various components that form the CI pipeline.

Clone or download this GitHub repository as we will need it to set up the CI/CD solution.

Creating the application and config repositories

You need to create two repositories - one for the application to drive CI and the other for the Kubernetes configuration to drive CD. I used GitHub myself, but you can also use other Git repositories so long as they are accessible by Tekton and ArgoCD.

For the application repository, check in the content of the app folder from the sample code repository.

For the configuration repository, check in the content of the config folder. Since the Kubernetes YAML file requires a hardcoded container image path from your DigitalOcean Container Registry, you need to copy todo-list/todo-list.yaml.template as todo-list/todo-list.yaml and replace the URL of your container registry once you have created it.

Creating access tokens for integration

You need to create the following access tokens for integration purposes:

A GitHub personal access token for the Tekton pipeline to update the Kubernetes YAML files in the configuration repository. The PAT must have all scopes under the repo category selected.
A DigitalOcean personal access token for the Tekton pipeline to push the application Docker images to the DigitalOcean Container Registry. The PAT must have the write scope selected.

Copy the tokens and set them aside for now, as you will later store them as Kubernetes secrets.

Installing CLI tools

To keep things simple, I chose to favor GUIs over CLI tools for this challenge. However you still need to install a few CLI tools as follows:

Creating the Kubernetes cluster in DigitalOcean

Setting up a Kubernetes cluster can be very involved, but DigitalOcean has made it simple. Once you have your account and logged into the dashboard, create a project and then create a Kubernetes cluster. For the challenge, I used the following settings:

After the Kubernetes cluster is provisioned after several minutes, install the NGINX Ingress Controller as a 1-Click App on the cluster details page:

An ingress controller is not a must, however I'd like to evaluate it on DigitalOcean for practical purposes. Allow it a few minutes to deploy. When it is done, run the doctl command from the cluster details page to connect kubectl to your cluster:

Creating the Kubernetes Container Registry

You need a container registry to store the application container images, so this is our chance to also evaluate the DigitalOcean Container Registry. In the DigitalOcean Dashboard, create a registry with the Basic subscription plan, as we require two repositories (one per Docker image) in the registry. When the registry is created, enable integration to your Kubernetes cluster in the settings.

Don't forget to create the Kubernetes YAML file in the configuration Git repository as mentioned earlier!

Installing Tekton

Tekton is an open-source, Kubernetes-native CI/CD framework. Tekton itself is a Kubernetes application and it provides custom resources as building blocks for pipelines. It took me a while to get past the finer granularity vs. GitHub Actions and Azure Pipelines, but I also appreciate its flexibility and tight integration with Kubernetes. In this challenge, I chose to implement CD using ArgoCD, however using Tekton for CD is also perfectly doable.

Installing Tekton is relatively simple. Follow the Getting Started guide to install the core components and set up the ConfigMaps for persistent volumes. You can also install the Tekton CLI and run the sample task if you like, but we won't use the CLI in this exercise. By default, Tekton uses the default service account in the cluster to run pipelines. While it would be a security best practice to create a separate service account for CI/CD, we will just keep the default for now.

As well, install the Tekton Dashboard which we will use to view pipeline progress. To not expose the Tekton Dashboard to the internet, we can use the port forwarding option to map a local port to the dashboard HTTP port. Open a new command prompt and run the following command:

kubectl --namespace tekton-pipelines port-forward svc/tekton-dashboard 9097:9097

Then open http://localhost:9097 in a web browser and verify that the dashboard is accessible.

Installing ArgoCD

Now let's switch gear to ArgoCD. ArgoCD is a declarative, GitOps CD tool for Kubernetes. Its concept is simple - detect changes between current state in the Kubernetes cluster and desired state per configuration in Git, and apply them either on-demand or automatically. With configurations in Git, various DevOps best practices such as version control, code review, and CD are possible.

Installing ArgoCD is also very simple - just follow the Getting Started guide. We won't need the CLI for the exercise, but feel free to install it when going through the guide. Once installation completes, you can forward the server port to access the ArgoCD UI without exposing it to the internet. Open a new command prompt and run the following command:

kubectl port-forward svc/argocd-server -n argocd 8080:443

Then open http://localhost:8080 in a web browser and log in as admin using the initial password from the secret (or the updated password if changed via the CLI), then verify that the dashboard is accessible.

Deploying the Tekton CI pipeline

Now we are ready to deploy the Tekton CI pipeline for the sample application. Before you start, take a look at the Concepts page in the Tekton documentation to understand what the building blocks are and how they interact with one another. But in essence, we need to know the following:

A pipeline consists of tasks to perform the required build operations.
A task consists of steps to encapsulate a build operation, such as cloning a Git repository or updating Kubernetes YAML files in the configuration Git repository.
Tekton provides a catalog of shared tasks and pipelines that can be used out of the box.
A step is an operation run in a container in the Kubernetes cluster.
A volume-based workspace is used by tasks and steps to share data across the pipeline

Coming from a traditional CI/CD world of Jenkins and such, running a step in a container seems awfully inefficient to me. However it is not the case, since containers spin up relatively quickly and Tekton handles the lifecycle anyway.

The CI pipeline for our sample to-do list application consists of the following tasks:

Use the git-clone task from the Tekton catalog to clone the application repository into a (shared) workspace.
Use the kaniko task from the Tekton catalog to build both todo-list-frontend and todo-list-backend using the provided Dockerfiles, and push the resulting images to the DigitalOcean Container Registry. Kaniko, by the way, is a tool to build container images inside a Kubernetes cluster.
Use a custom task we put together to clone the configuration repository, update the container image version in the Kubernetes YAML file, and commit the changes to Git (for ArgoCD to pick up).

Let's now set up the Tekton pipeline for the sample application.

Setting up the prerequisites

Perform the following steps to provision everything the pipeline needs in the Kubernetes cluster. There are many steps, so hang in there!

Create a namespace called todo-list-ci to host the Tekton pipeline resources with the following command:
```
 kubectl create namespace todo-list-ci
```
Create a secret that the custom task uses to commit the updated Kubernetes YAML file to the configuration repository on GitHub. Since the step uses git command with no user interaction, I decided to use the GIT_ASKPASS environment variable to provide the PAT after reading this article as a quick and dirty solution. Create a new YAML file using the following template (see also tekton/config-git-askpass.yaml.template in the sample code repository) and your GitHub PAT previously created:
```
 apiVersion: v1 kind: Secret metadata:     name: config-git-askpass type: Opaque stringData:     .git-askpass: |         echo "### Your GitHub PAT goes here ###"
```
Then create the secret with the following command:
```
 kubectl apply -f config-git-askpass.yaml -n todo-list-ci
```
Create a secret that provides the kaniko task a Docker config to push the application container images to our DigitalOcean Container Registry. Note that the kaniko task expects the file name to be config.json, which is different from the default file names that Kubernetes expects. In any case, create a new YAML file using the following template (see also tekton/do-docker-config.yaml.template in the sample code repository) and your DigitalOcean PAT previously created:
```
 apiVersion: v1 kind: Secret metadata:     name: do-docker-config type: kubernetes.io/dockercfgjson data:     config.json: |         ### Your DigitalOcean PAT goes here ###
```
Then create the secret with the following command:
```
 kubectl apply -f do-docker-config.yaml -n todo-list-ci
```

Install the git-clone and kaniko tasks from the Tekton catalog with the following commands:

 kubectl apply -f https://raw.githubusercontent.com/tektoncd/catalog/main/task/git-clone/0.5/git-clone.yaml -n todo-list-ci kubectl apply -f https://raw.githubusercontent.com/tektoncd/catalog/main/task/kaniko/0.5/kaniko.yaml -n todo-list-ci

The pipeline requires a shared workspace to store the source code and configuration from Git. In the PipelineRun configuration, we need to provide a persistent volume claim (PVC) to provision the workspace. The YAML file is as follows (see also tekton/todo-list-ci-pvc.yaml in the sample code repository):
```
 apiVersion: v1 kind: PersistentVolumeClaim metadata:   name: todo-list-ci-pvc spec:   accessModes:     - ReadWriteOnce   resources:     requests:       storage: 5Gi
```
To create the PVC, run the following command:
```
 kubectl apply -f  todo-list-ci-pvc.yaml -n todo-list-ci
```

Installing the custom task resource

Since there is no readily available task to update the Kubernetes YAML file with new image versions, I had to create a custom task. Drawing inspiration from Sebastian Daschner's example, the resulting task resource is defined as follows (see also tekton/update-config-task.yaml in the sample code repository):

apiVersion: tekton.dev/v1beta1kind: Taskmetadata:  name: update-configspec:  params:  - name: buildVersion  - name: gitUrl  - name: gitBranch  - name: k8sYaml  - name: gitUserName  - name: gitUserEmail  workspaces:  - name: git-source  - name: git-askpass-secret  steps:  - name: git-checkout    image: alpine/git:v2.26.2    workingDir: "$(workspaces.git-source.path)"    script: |      #!/usr/bin/env sh      set -e      cp $(workspaces.git-askpass-secret.path)/.git-askpass ~/      chmod +x ~/.git-askpass      export GIT_ASKPASS=~/.git-askpass      rm -rf config      git clone -b $(inputs.params.gitBranch) $(inputs.params.gitUrl) config  - name: update-yaml    image: alpine/git:v2.26.2    workingDir: "$(workspaces.git-source.path)"    script: |      #!/usr/bin/env sh      set -e      cd config      sed -i "s#/todo-list-frontend:[a-zA-Z0-9.]\\+#/todo-list-frontend:$(inputs.params.buildVersion)#" $(inputs.params.k8sYaml)      sed -i "s#/todo-list-backend:[a-zA-Z0-9.]\\+#/todo-list-backend:$(inputs.params.buildVersion)#" $(inputs.params.k8sYaml)      cat $(inputs.params.k8sYaml)  - name: commit-push-changes-gitops    image: alpine/git:v2.26.2    workingDir: "$(workspaces.git-source.path)"    script: |      #!/usr/bin/env sh      set -e      cd config      cp $(workspaces.git-askpass-secret.path)/.git-askpass ~/      chmod +x ~/.git-askpass      export GIT_ASKPASS=~/.git-askpass      git config --global user.email "$(inputs.params.gitUserEmail)"      git config --global user.name "$(inputs.params.gitUserName)"      git add .      git commit --allow-empty -m "[tekton] Set deployment to version $(inputs.params.buildVersion)"      git push origin $(inputs.params.gitBranch)

As you can see, there is a common workspace used by all three steps, each of which runs a different container and script. The task is also using the secret that contains the GitHub PAT as a file that the GIT_ASKPASS environment variable refers to. As for the logic, the task first clones the configuration repository on GitHub, then uses sed to update the container image version in todo-list.yaml, and finally commits/pushes the change into Git.

Now that you understand how this custom task works, create it in your Kubernetes cluster with the following command:

kubectl apply -f  update-config-task.yaml -n todo-list-ci

Installing the pipeline resource

Lastly, we need to install the pipeline resource. Refer to the definition below (see also tekton/todo-list-ci-pipeline.yaml in the sample code repository):

apiVersion: tekton.dev/v1beta1kind: Pipelinemetadata:  name: todo-list-ci-pipelinespec:  workspaces:    - name: git-source    - name: do-docker-config-secret    - name: config-git-askpass-secret  params:    - name: buildVersion    - name: appGitUrl    - name: appGitBranch      default: main    - name: frontendPathToContext    - name: frontendImageUrl    - name: backendPathToContext    - name: backendImageUrl    - name: configGitUrl    - name: configGitBranch      default: main    - name: k8sYaml    - name: gitUserName    - name: gitUserEmail  tasks:    - name: clone-repo      taskRef:        name: git-clone      workspaces:        - name: output          workspace: git-source      params:        - name: url          value: "$(params.appGitUrl)"        - name: revision          value: "$(params.appGitBranch)"        - name: subdirectory          value: "app"        - name: deleteExisting          value: "true"    - name: frontend-build-image      taskRef:        name: kaniko      runAfter:        - clone-repo      workspaces:        - name: source          workspace: git-source        - name: dockerconfig          workspace: docker-config-secret      params:        - name: CONTEXT          value: app/$(params.frontendPathToContext)        - name: IMAGE          value: $(params.frontendImageUrl):$(params.buildVersion)    - name: backend-build-image      taskRef:        name: kaniko      runAfter:        - clone-repo      workspaces:        - name: source          workspace: git-source        - name: dockerconfig          workspace: do-docker-config-secret      params:        - name: CONTEXT          value: app/$(params.backendPathToContext)        - name: IMAGE          value: $(params.backendImageUrl):$(params.buildVersion)    - name: update-config      taskRef:        name: update-config      runAfter:        - frontend-build-image        - backend-build-image      workspaces:        - name: git-source          workspace: git-source        - name: git-askpass-secret          workspace: config-git-askpass-secret      params:        - name: buildVersion          value: "$(params.buildVersion)"        - name: gitUrl          value: "$(params.configGitUrl)"        - name: gitBranch          value: "$(params.configGitBranch)"        - name: k8sYaml          value: "$(params.k8sYaml)"        - name: gitUserName          value: "$(params.gitUserName)"        - name: gitUserEmail          value: "$(params.gitUserEmail)"

The definition is a tad long, but it performs the tasks as explained earlier. Let's create it with the following command:

kubectl apply -f  todo-list-ci-pipeline.yaml -n todo-list-ci

Now that our Tekton CI pipeline is ready to go, feel free to take a well-deserved break and digest all the information we've gone through so far.

Create an application in ArgoCD

At last, we need to create the CD pipeline to complete our solution. Luckily this is very simple to do in the ArgoCD UI. You could use a declarative setup to see full benefits, but let's take it easy for now. Follow the steps below to create the application:

Log in to the ArgoCD UI.
Click the New App button.
In the General section, enter an application name and select the default project.
Scroll down to the Source section and enter your configuration repository URL and todo-list for the path.
Scroll down to the Destination section and select https://kubernetes.default.svc as the cluster URL (this is the value for deploying to the same cluster in which ArgoCD is running) and enter todo-list as the namespace. Click the Create button to create the app.

The todo-list app is now created with the missing and OutOfSync status:

Don't sync it just yet because we haven't deployed the container images to the container registry! We will get the container images ready by running the CI pipeline.

Running the Tekton CI pipeline

There are two ways to run a pipeline in Tekton:

Creating a PipelineRun Kubernetes custom resource
Configuring a Tekton Trigger to automatically start pipeline runs

Triggers can be overwhelming for beginners, so we will just manually create a PipelineRun resource to start the pipeline for this exercise. Create a new YAML file using the following template (see also tekton/todo-list-ci-pipelinerun.yaml.template in the sample code repository) and replace the values with ones specific to your environment:

apiVersion: tekton.dev/v1beta1kind: PipelineRunmetadata:  generateName: todo-list-ci-spec:  pipelineRef:    name: todo-list-ci-pipeline  params:    - name: buildVersion      value: "0.0.1"    - name: appGitUrl      value: ### Your app Git repo URL goes here ###    - name: appGitBranch      value: ### Your app Git branch name goes here (typically main or master) ###    - name: frontendPathToContext      value: "todo-list-frontend"    - name: frontendImageUrl      value: "### Your DigitalOcean Container Registry URL goes here ###/todo-list-frontend"    - name: backendPathToContext      value: "todo-list-backend"    - name: backendImageUrl      value: ""### Your DigitalOcean Container Registry URL goes here ###/todo-list-backend"    - name: configGitUrl      value: ### Your config Git repo URL goes here ###    - name: configGitBranch      value: ### Your config Git branch name goes here (typically main or master) ###    - name: k8sYaml      value: "todo-list/todo-list.yaml"  serviceAccountName: default  workspaces:    - name: git-source      persistentVolumeClaim:        claimName: todo-list-ci-pvc    - name: do-docker-config-secret      secret:        secretName: do-docker-config    - name: config-git-askpass-secret      secret:        secretName: config-git-askpass

To start a new pipeline run, run the following command:

kubectl create -f  todo-list-ci-pipelinerun.yaml -n todo-list-ci

The command should complete with the PipelineRun ID similar to the following:

pipelinerun.tekton.dev/todo-list-ci-r2f94 created

Now, head over to the Tekton Dashboard in a web browser. Select PipelineRuns from the left menu and you will see that our pipeline is running!

Click the PipelineRun name and you will see the progress. In the screenshot below, I opened the build-and-push step under the frontend-build-image task, so we can see the progress of Kaniko building the image for the frontend application.

When this is done, verify in your configuration GitHub repository that a commit has been made to update the container image versions to 0.0.1. Next we can verify the deployment in ArgoCD!

Synchronizing the application in ArgoCD for the first time

Head over to the ArgoCD UI in the web browser. Then follow these steps to manually synchronize the application manifests, which will trigger a full deployment. Click the Sync button in the todo-list application tile to open the synchronize dialog, then click the Synchronize button to start the process.

Wait for the status to show the green Healthy and Synced statuses, then click the tile to see the deployment details. In the application details page, you will see a visualization of the Kubernetes resources for the application and other details. To open the application, click the 3rd button inside the todo-list ingress box as highlighted in the screenshot below:

Finally we get to see the fruit of our labor! Feel free to play around with the to-do list application to make sure that it is working.

Enabling auto-sync in ArgoCD to complete the CI/CD solution

As we have finally verified a deployment, we need to make one final change to complete our CI/CD solution. To truly enable GitOps and CD, we need to turn on auto-sync for the application in ArgoCD.

Back on the app details page in ArgoCD UI, click the App Details button to open the details dialog. In the Summary tab, scroll down to the Sync Policy section and click the Enable Auto-Sync button.

Then click the OK button when the prompt appears:

Lastly, we will trigger another run for the Tekton CI pipeline for the final verification. We could make some functional changes to the todo-list application, but to test auto-sync we can just update the image version to simulate an "application update". Edit the PipelineRun YAML file created earlier and change the buildVersion parameter value to, say, 0.0.2. Then run the following command to start a new pipeline run:

kubectl create -f  todo-list-ci-pipelinerun.yaml -n todo-list-ci

Check the pipeline progress in Tekton Dashboard. When the run completes, go back to ArgoCD UI and verify that it has automatically synchronize after on the Git changes. You may need to refresh the status in the UI to reflect the latest status. In the application details page, the current sync status now shows that version 0.0.2 is deployed:

Congratulations! You have successfully created a basic Kubernetes-native CI/CD solution using Tekton and ArgoCD on DigitalOcean!

After you have admired your creation and played with the environment more, don't forget to destroy the Kubernetes cluster in DigitalOcean and any leftover resources (volumes and load balancers for the ingress) to avoid unnecessary cost!

What's next?

In this challenge, I have only scratched the surface of Tekton and ArgoCD. With my remaining DigitalOcean credits available for another month, I would like to build upon the current solution with the following:

Implement trigger to truly complete the CI pipeline
Extend the current solution to multiple environments (dev, QA, prod) and evaluate various high availability features
Research on best practices and improve the current solution

Check back later for new posts as I explore further!

Resources

During the challenge, I found the following resources that helped me understand how to use the tools and solve unexpected issues. Credit goes to the authors and I hope this curated list helps you as well.

Kubernetes-Native Build & Release Pipelines with Tekton and ArgoCD by Adri Villela - A comprehensive series on end-to-end Kubernetes-native CI/CD, as recommended on the challenge page itself.
Tekton with ArgoCD by Sebastian Daschner - A great blueprint for CI/CD with Tekton and ArgoCD. It saved me time scripting the Git update part for this challenge.
ArgoCD Tutorial for Beginners | GitOps CD for Kubernetes by TechWorld with Nana - A well-explained video tutorial for ArgoCD on YouTube. She has a knack for explaining complex concepts and I am a big fan of her videos.
How to fix in Kubernetes Deleting a PVC stuck in status "Terminating" - This article helped me fix a problem where I could not delete a PVC and the associated volume.
The Vanilla DevOps Git Credentials & Private Packages Cheatsheet by AJ ONeal - This cheatsheet helped me choose the best Git credentials management strategy for this challenge.