What We Do
How We Do It
Innablr Story
Our Team
Careers
Blog
Get In Touch

Using STNO for automating AWS Transit Gateway management

By Alex Bukharov on Sep 6, 2021

Preface

We are going to talk about automating AWS Transit Gateway routing by using one of the most under-rated and under-used solutions from AWS - STNO. STNO stands for Serverless Transit Network Orchestrator, which should sound very descriptive for those, who have configured AWS Transit Gateway, and especially for those, who have tried automating this process.

In this blog we will try to identify the problems connected to automating Transit Gateway configuration and see why STNO is a good solution to those problems, although not without a couple of shortcomings.

If you haven’t yet read an awesome blog post about Transit Gateway from Mahesh Rayas of Innablr, please do.

The problem

AWS Transit Gateway is a powerful concept for providing connectivity between networks of different types in AWS, be it a VPC, a Direct Connect Gateway, a VPN Gateway or any other. AWS have done a great job of hiding all the enormous amounts of complexity, which comes with that, under the bonnet, so you don’t have to deal with all the intricacies of the associated networking, but the things that you still have to work with on a day-to-day basis are many, they are difficult to keep in one’s head all at the same time, and they are also distributed all around your AWS landscape, which makes automating them a nightmare.

Let’s look at a typical TGW deployment. This diagram shows an interconnect between a Tenant VPC, a Shared Services VPC and a Direct Connect, which is what most people use TGW for (another good pattern is shared NAT, which we may talk about in my next blog).

TGW Deployment

You can see that the above diagram is a bit convoluted. Let’s go through what’s involved in letting that Tenant VPC communicate with the Shared Service and Direct Connect:

  1. Go to the VPC console in the Tenant account and attach the VPC to the TGW (assuming you’ve already accepted the RAM shared of the TGW in the Tenant account). Don’t forget to select the subnets that you want to be on the attachment
  2. Go to the VPC console in Shared Services and put the Tenant VPC into associations on the CLIENT-FLAT TGW route table
  3. Add the Tenant VPC into propagations on the SHARED-SERVICES TGW route table
  4. Add the Tenant VPC into propagations on the DIRECT-CONNECT TGW route table
  5. Go back to the Tenant Account VPC console and include the Shared Services CIDR, as well as on-prem CIDRs into the subnet route tables for the appropriate subnets
  6. Make sure that Shared Services has the route to the Tenant VPC (you probably route Shared Services to your entire AWS IP range anyways)
  7. Make sure that Shared Services and Direct Connect are included into propagations on the CLIENT-FLAT TGW route table (this should be already done, unless you are connecting the first tenant)

It’s a lot to take care of, isn’t it. Wait until you have to connect up a hundred or two tenant accounts and enjoy managing those TGW route tables. You will desperately want to do some automation.

But soon you will find that it’s very difficult to properly automate this process end-to-end. The main reason is that the information about connectivity is shared between the tenant and shared services accounts. Implementing cross-account automation in AWS always poses an ideological problem about access and permissions. Engineering poses a few interesting challenges as well. Cloudformation has poor support for sharing information cross-account, Terraform makes this problem a bit easier to solve, but at at cost and not by far.

Enter STNO

STNO solves this automation problem in quite an efficient way. It comes in two Cloudformation templates - the Hub and the Spoke. In the above scenario with Shared Services and Tenant, you would deploy the Hub into Shared Services and the Spoke you would make a Cloudformation StackSet and uniformly deploy to every tenant account. Hub will control the TGW and TGW route tables and Spoke will monitor tagging events in the tenant account.

STNO subscribes to tagging of the VPC and the subnets via AWS Event Bridge and when it detects the ‘magic’ tags set or unset, it triggers a TGW attachment and correspondent changes in the TGW and subnet route tables. See the diagram below (official from AWS, more you can find here)

STNO Architecture

STNO also offers a web-ui with Cognito authentication, where you can approve/deny attachments, if you so desire, and see what’s happening. It doesn’t require you to re-deploy TGW or the route tables, you can make it use the existing ones, which allows you to retrofit STNO into your existing infrastructure.

STNO is open-source, you can find the code here.

How to deploy

AWS has an implementation guide, that thoroughly covers multiple use-cases. You definitely need to have a read to be able to design the implementation in a way that suits your needs.

Here is what I did to deploy STNO over an existing TGW installation:

  1. Download the Hub and Spoke Cloudformation templates from the Github repo above
  2. In the Hub template you will find the TGW route tables for STNO. You can either edit these route tables to match your requirements, or delete them completely if you are planning on using your existing ones
  3. Deploy the Hub into the account where you have your TGW deployed. Pay attention to the following details: a. STNO hub allows you to set IP prefixes or prefix lists, which it will put into subnet route tables. It’s one list for all tenants, so you will have to spend some time thinking about what you put there. Also you can resort to manual subnet route tables management. b. You can set names for the tags STNO will recognise (I personally don’t see any reason for doing that, the defaults look fine) c. You can set up a manual approval for the TGW route tables you want. A use-case is when you don’t want everyone to be able to attach themselves to Direct Connect (remember that VPC tags are set in the tenant account)
  4. Set up a StackSet with the Spoke template and deploy instances into the tenant accounts
  5. You are ready to go

STNO nicely integrates with various landing zone solutions, such as AWS Landing Zone, AWS Control Tower and potentially others. I personally implemented it within AWS Landing Zone and had a relatively smooth ride.

How to use

We are getting close to the most interesting part. Let’s cover the use-case, from where we discussed the complexity of the traditional TGW workflow, a tenant VPC needing access to Shared Services and Direct Connect:

  1. Open the VPC console in the Tenant account
  2. On the VPC set the following tags:
    • Associate-with: CLIENT-FLAT
    • Propagate-to: SHARED-SERVICES,DIRECT-CONNECT
  3. One-by-one open the subnets that you want to go into the attachment and on them set the following tags ("" meaning an empty value):
    • Attach-to-tgw: ""

That’s everything you need to do. STNO will take care of accepting the RAM share, if it needs to be accepted, the attachments, associations and propagations. Seriously, you don’t need to do anything else. Isn’t it an improvement!?

Shortcomings of STNO

Not everything is roses and petals. Over the course of using STNO I’ve run into a few shortcomings, one of which was particularly annoying:

  1. Attachments, associations and propagations all happen asynchronously in the background and are not under control of CloudFormation (or what you use for IaC). It means that you can’t refer to those attachments, associations or propagations from your CloudFormation stacks. When I hit this issue I ended up implementing a dirty hack you can borrow:

    WaitTGWAttachmentCustomResourceLambda:
        Type: AWS::Lambda::Function
        Properties:
        Code:
            ZipFile: |
                const AWS = require('aws-sdk');
                const response = require('cfn-response');
                const ec2 = new AWS.EC2();
    
                function send(evt, ctx, status, data) {
                    return new Promise(() => { response.send(evt, ctx, status, data) });
                }
    
                exports.handler = async function(event, context) {
                    console.log('Got event: %j', event);
                    const props = event.ResourceProperties;
                    let attached = false;
                    for (let i=0; i<10; i++) {
                        try {
                            console.log('Searching for TGW attachments by VPC ID %s...', props.VPCID);
                            const params = {Filters:[{Name: 'vpc-id', Values: [props.VPCID]},{Name: 'transit-gateway-id', Values: [props.TGWID]}]};
                            const r = await ec2.describeTransitGatewayVpcAttachments(params).promise();
                            console.log('Got: %j, looking for subnets %j', r, props.SubnetIds);
                            if (props.SubnetIds.every(x => r.TransitGatewayVpcAttachments[0].SubnetIds.includes(x))) {
                            await send(event, context, response.SUCCESS);
                            return null;
                            }
                        } catch (err) {
                            console.log('Oops: %j', err);
                        }
                        await new Promise(resolve => setTimeout(resolve, 10000));
                    }
                    await send(event, context, response.FAILED, {errorMessage: `Attachment for VPC ${props.VPCID} subnets ${props.SubnetIds} not found`});
                    return null;
                }            
        Description: Provides various information about Transit Gateway
        Handler: index.handler
        MemorySize: 128
        Role: !GetAtt WaitTGWAttachmentCustomResourceLambdaRole.Arn
        Runtime: nodejs14.x
        Timeout: 900
    
    WaitTGWAttachment:
        Metadata:
        cfn-lint:
            config:
            ignore_checks:
                - W1001
        Type: Custom::WaitTGWAttachment
        Condition: IsSharedNAT
        Properties:
        ServiceToken: !GetAtt WaitTGWAttachmentCustomResourceLambda.Arn
        TGWID: !Ref TGWID
        VPCID: !Ref VPC
        SubnetIds:
            - !Ref PrivateSubnet1A
            - !Ref PrivateSubnet2A
            - !If [IsSharedNAT&3AZ, !Ref PrivateSubnet3A, !Ref 'AWS::NoValue']
            - !If [IsSharedNAT&4AZ, !Ref PrivateSubnet4A, !Ref 'AWS::NoValue']
    

    If you manage to find a better solution, please let me know, let’s make people’s lives easier together.

  2. There is no way to individually specify CIDRs or prefix lists for spoke VPCs to put in their subnet route tables, you only can configure one platform-wide CIDR list, or a list of prefix lists for the entire system, which will make you rely on routing configuration outside STNO, should you need to add additional routes to individual VPCs. See here

  3. STNO doesn’t support parallel operations

Conclusion

AWS Transit Gateway is a complicated device, and the way it’s implemented in AWS, naturally requires a lot of competency from the engineers who deploy it. What’s worse, performing day-to-day tasks becomes a burden on the operator, and over time the complexity of the solution tends to go out of control.

STNO offers a neat way to ease that burden by hiding all this complexity under the bonnet. It’s a well thought-through solution, even though not without a few wrinkles here and there. I’m pretty convinced that it adds a lot of value and is one of the most overlooked solutions from AWS.

If you have any questions or need a hand, please don’t hesitate to contact me on alex.bukharov@innablr.com.au.

Happy hacking, Alex.

Share this post:

Image