A tail of unexpected Costs

Part 1: Disclaimer

To start off this is a tail entirely of my incompetence, it has nothing to do with AWS or poor comunication or anything like it. It’s a beware for people that don’t read the docs and then proceed to deploy a prototype into production. I’ll continue to use the AWS cloud infrastructure but now with a few dollars in learning expenses.

Part 2: Move fast…

This all happended in the context of deploying my DynaPict service. With the core functionality barely working and me having the goal of getting a (to non-technical users) presentable system off the launchpad fast i set my sites to deploying the mess. It was less painfull then expected although getting Node.js to work as I wanted it to felt a bit like a bodge…

Then it was live! my baby, many hours of work, right there on the internet! I felt amazing. With feature updates rolling at a steady rate all was going smoothly. Until…

Part 3: RTFM.

I got an email. “AWS billing alert”. Fuck.

Ok panik sets in. At least it’s not a huge cost overrun, the limit was set low enough to not bankrupt myself. Crisis averted for now. Next up: finding where the money went. Instance costs are nominal, i got no gost services running i forgot to kill, the search continues.

Turns out i didn’t RTFM and the thing that cost me were multiple gigabytes of “AWS Outbound trafic” a day. The graph neatly correlated with the point at which the client had started to use the service for nontrivial use. I quickly realized what had happened - The image files served up by my service were to big for the brower-cache on the client and were reloaded every time. Hundrets if not thousands of uncompressed high-quality image files. A quick call with the relevant parties later the service went into hybernation and the dev into tinkering.

Part 4: One doesn’t simply production an alpha!

Aright service is down again, time to do an informal one-person RCA. The next morning confirmed that it was indeed caused by Dynapict - There were no new chartges after service shutdown. Ok so that narrows it to asset delivery but how to fix this mess? I got gigabytes of encypted userdata on my server, time to get creative.

Scaling starts in the mind

(Sorry for the cheesy section name) but I’m being serious here. This is obvious to any seasoned dev working on big software but for me managing my first ever deployment this was a realization that hit me too late. Being able to scale your components requires more than just assuming they’ll cope with more load. It requires thinking about all dataflows, even those not internal to your architecture but also those from and to the environment. If I had done that I may have realized that serving up Ultra-HD pictures 24/7 could be expensive.

Thinking about possible solutions i realized that there was a constraint I overlooked: The displaying device is an old tablet with a screen resolution of (not that much) by (a few pixles at best) - time to exploit that.

After minimal testing i realized that i could force all files to be JPGs with a resolution of 1200px max and 60% quality. The pixel mush looks horrible on my dev environment but the end user doesn’t/didn’t even really notice the difference.

That’s how i could now theoretically squish my Storage from a few Giga- to a manageable amount ob Megabits and reduce outbound trafic by a factor of yes. With permission by the rights holder I proceeded to do a test run decrypting the storage, squishing the files and encrypting them again after confirming that it works with some reference images first. Now my storage bucket is almost empty again and my Networking bill disapeared, yay for “fixing it in production”.

TL;DR

If you deploy anything anywehere where you have to pay: check all possible sources of cost creation prior to going live. I did not do that, payed for the learning expierence. Don’t do the same.