The Challenge
In my first year supporting a complex ArcGIS platform hosted on AWS, the system was functional but accumulating risk. Security posture had gaps. Costs were higher than they needed to be. The Terraform codebase had grown in ways that made individual changes unnecessarily risky, because too many components were coupled together. And the caching approach, while working, was creating inefficiencies that were compounding over time.
None of this was a crisis. But left unaddressed, each problem would only get harder to fix.
What I Found
A thorough review of the full architecture - multiple VPCs, auto-scaled compute layers, shared storage, and database components - made the improvement opportunities clear.
On security: there was no web application firewall in place, HTTPS was not consistently enforced across services, and SSH access was still being used where a more controlled, auditable mechanism was available.
On maintainability: the Terraform code was structured monolithically. Changing one component meant touching, and potentially destabilising, others. For a platform this size, that was a real deployment risk.
On cost and performance: resources were not right-sized, and the file-based caching approach was creating I/O overhead that a database-backed model could eliminate.
How I Approached It
I prioritised changes that were high-impact and low-risk - improvements that could be made without requiring escalation or significant change management overhead, because they were clearly aligned with best practice.
Security came first. I introduced WAF protections, enforced HTTPS across all services, and replaced SSH access with Systems Manager, improving both the security posture and the audit trail for access events.
I then modularised the Terraform codebase so that individual components could be updated and deployed independently. This reduced the blast radius of any single change and made the infrastructure significantly easier to reason about and maintain.
For cost and performance, I right-sized resources based on actual usage patterns, and redesigned the caching layer by moving from a file-based approach to a PostgreSQL-backed model. This improved efficiency and reduced the I/O load on the underlying infrastructure.
What Changed
Annual platform costs dropped from approximately $120,000 to around $65,000. System performance improved by roughly 30%. The platform became materially more secure, easier to maintain, and better positioned for future changes.
These were not theoretical improvements. They translated directly into reduced operational overhead, lower financial exposure, and a platform the team could evolve with confidence rather than caution.
Lessons for Enterprise Cloud Platform Management
The best time to improve a platform is before it is in crisis. Functional but accumulating risk is the most common state for mature cloud platforms, and it is also the hardest state to act on because there is no urgency forcing a decision. The discipline is in recognising that cost, security and maintainability debt are real costs, just deferred ones.
Modularity is a risk management strategy. Tightly coupled infrastructure is not just harder to maintain. It is harder to change safely. Modularising the Terraform codebase did not just improve the developer experience; it reduced the risk profile of every future change. That compound benefit justifies the upfront effort.
Security improvements do not have to be disruptive. WAF, HTTPS enforcement and Systems Manager access are all well-understood patterns. Implementing them does not require a major project. It requires prioritisation and sequencing. If a platform is missing these controls, the question is usually why they have not been done, not how to do them.
Performance and cost are often the same problem. The caching redesign improved both metrics simultaneously. Over-provisioned resources, inefficient I/O patterns and poor architectural choices tend to show up as both performance issues and cost inefficiencies. Treating them as the same problem, rather than separate workstreams, leads to better solutions with less effort.