We are going to talk about automating AWS Transit Gateway routing by using one of the most under-rated and under-used solutions from AWS - STNO. STNO stands for Serverless Transit Network Orchestrator, which should sound very descriptive for those, who have configured AWS Transit Gateway, and especially for those, who have tried automating this process.
In this blog we will try to identify the problems connected to automating Transit Gateway configuration and see why STNO is a good solution to those problems, although not without a couple of shortcomings.
If you haven’t yet read an awesome blog post about Transit Gateway from Mahesh Rayas of Innablr, please do.
AWS Transit Gateway is a powerful concept for providing connectivity between networks of different types in AWS, be it a VPC, a Direct Connect Gateway, a VPN Gateway or any other. AWS have done a great job of hiding all the enormous amounts of complexity, which comes with that, under the bonnet, so you don’t have to deal with all the intricacies of the associated networking, but the things that you still have to work with on a day-to-day basis are many, they are difficult to keep in one’s head all at the same time, and they are also distributed all around your AWS landscape, which makes automating them a nightmare.
Let’s look at a typical TGW deployment. This diagram shows an interconnect between a Tenant VPC, a Shared Services VPC and a Direct Connect, which is what most people use TGW for (another good pattern is shared NAT, which we may talk about in my next blog).
You can see that the above diagram is a bit convoluted. Let’s go through what’s involved in letting that Tenant VPC communicate with the Shared Service and Direct Connect:
It’s a lot to take care of, isn’t it. Wait until you have to connect up a hundred or two tenant accounts and enjoy managing those TGW route tables. You will desperately want to do some automation.
But soon you will find that it’s very difficult to properly automate this process end-to-end. The main reason is that the information about connectivity is shared between the tenant and shared services accounts. Implementing cross-account automation in AWS always poses an ideological problem about access and permissions. Engineering poses a few interesting challenges as well. Cloudformation has poor support for sharing information cross-account, Terraform makes this problem a bit easier to solve, but at at cost and not by far.
STNO solves this automation problem in quite an efficient way. It comes in two Cloudformation templates - the Hub and the Spoke. In the above scenario with Shared Services and Tenant, you would deploy the Hub into Shared Services and the Spoke you would make a Cloudformation StackSet and uniformly deploy to every tenant account. Hub will control the TGW and TGW route tables and Spoke will monitor tagging events in the tenant account.
STNO subscribes to tagging of the VPC and the subnets via AWS Event Bridge and when it detects the ‘magic’ tags set or unset, it triggers a TGW attachment and correspondent changes in the TGW and subnet route tables. See the diagram below (official from AWS, more you can find here)
STNO also offers a web-ui with Cognito authentication, where you can approve/deny attachments, if you so desire, and see what’s happening. It doesn’t require you to re-deploy TGW or the route tables, you can make it use the existing ones, which allows you to retrofit STNO into your existing infrastructure.
STNO is open-source, you can find the code here.
AWS has an implementation guide, that thoroughly covers multiple use-cases. You definitely need to have a read to be able to design the implementation in a way that suits your needs.
Here is what I did to deploy STNO over an existing TGW installation:
STNO nicely integrates with various landing zone solutions, such as AWS Landing Zone, AWS Control Tower and potentially others. I personally implemented it within AWS Landing Zone and had a relatively smooth ride.
We are getting close to the most interesting part. Let’s cover the use-case, from where we discussed the complexity of the traditional TGW workflow, a tenant VPC needing access to Shared Services and Direct Connect:
Associate-with
: CLIENT-FLAT
Propagate-to
: SHARED-SERVICES,DIRECT-CONNECT
""
meaning an empty value):
Attach-to-tgw
: ""
That’s everything you need to do. STNO will take care of accepting the RAM share, if it needs to be accepted, the attachments, associations and propagations. Seriously, you don’t need to do anything else. Isn’t it an improvement!?
Not everything is roses and petals. Over the course of using STNO I’ve run into a few shortcomings, one of which was particularly annoying:
Attachments, associations and propagations all happen asynchronously in the background and are not under control of CloudFormation (or what you use for IaC). It means that you can’t refer to those attachments, associations or propagations from your CloudFormation stacks. When I hit this issue I ended up implementing a dirty hack you can borrow:
WaitTGWAttachmentCustomResourceLambda:
Type: AWS::Lambda::Function
Properties:
Code:
ZipFile: |
const AWS = require('aws-sdk');
const response = require('cfn-response');
const ec2 = new AWS.EC2();
function send(evt, ctx, status, data) {
return new Promise(() => { response.send(evt, ctx, status, data) });
}
exports.handler = async function(event, context) {
console.log('Got event: %j', event);
const props = event.ResourceProperties;
let attached = false;
for (let i=0; i<10; i++) {
try {
console.log('Searching for TGW attachments by VPC ID %s...', props.VPCID);
const params = {Filters:[{Name: 'vpc-id', Values: [props.VPCID]},{Name: 'transit-gateway-id', Values: [props.TGWID]}]};
const r = await ec2.describeTransitGatewayVpcAttachments(params).promise();
console.log('Got: %j, looking for subnets %j', r, props.SubnetIds);
if (props.SubnetIds.every(x => r.TransitGatewayVpcAttachments[0].SubnetIds.includes(x))) {
await send(event, context, response.SUCCESS);
return null;
}
} catch (err) {
console.log('Oops: %j', err);
}
await new Promise(resolve => setTimeout(resolve, 10000));
}
await send(event, context, response.FAILED, {errorMessage: `Attachment for VPC ${props.VPCID} subnets ${props.SubnetIds} not found`});
return null;
}
Description: Provides various information about Transit Gateway
Handler: index.handler
MemorySize: 128
Role: !GetAtt WaitTGWAttachmentCustomResourceLambdaRole.Arn
Runtime: nodejs14.x
Timeout: 900
WaitTGWAttachment:
Metadata:
cfn-lint:
config:
ignore_checks:
- W1001
Type: Custom::WaitTGWAttachment
Condition: IsSharedNAT
Properties:
ServiceToken: !GetAtt WaitTGWAttachmentCustomResourceLambda.Arn
TGWID: !Ref TGWID
VPCID: !Ref VPC
SubnetIds:
- !Ref PrivateSubnet1A
- !Ref PrivateSubnet2A
- !If [IsSharedNAT&3AZ, !Ref PrivateSubnet3A, !Ref 'AWS::NoValue']
- !If [IsSharedNAT&4AZ, !Ref PrivateSubnet4A, !Ref 'AWS::NoValue']
If you manage to find a better solution, please let me know, let’s make people’s lives easier together.
There is no way to individually specify CIDRs or prefix lists for spoke VPCs to put in their subnet route tables, you only can configure one platform-wide CIDR list, or a list of prefix lists for the entire system, which will make you rely on routing configuration outside STNO, should you need to add additional routes to individual VPCs. See here
STNO doesn’t support parallel operations
AWS Transit Gateway is a complicated device, and the way it’s implemented in AWS, naturally requires a lot of competency from the engineers who deploy it. What’s worse, performing day-to-day tasks becomes a burden on the operator, and over time the complexity of the solution tends to go out of control.
STNO offers a neat way to ease that burden by hiding all this complexity under the bonnet. It’s a well thought-through solution, even though not without a few wrinkles here and there. I’m pretty convinced that it adds a lot of value and is one of the most overlooked solutions from AWS.
If you have any questions or need a hand, please don’t hesitate to contact me on alex.bukharov@innablr.com.au.
Happy hacking, Alex.
Share this post: