23rd
See: http://aws.amazon.com/ebs/
It’s pretty clearly disclosed that EBS volumes aren’t terribly durable. So in fairness, it is what it is and is disclosed as it is.
An annual failure rate for a volume of 0.1% – 0.5% is mentioned. That means you should be prepared to lose a volume: it could happen. Backups / snapshots would suffice if you don’t mind losing your data since the last backup (unlikely). If you do mind, use database replication to a separate availability zone or region.
An AFR of 0.1% – 0.5% is quite a bit lower than that of a single drive, but is pretty high compared to a high end SAN product’s probability of losing data. Of course, those can be extremely expensive so this is a bit apples and oranges.
For me, for most problems I’d be quite comfortable using EBS if I’m continuously replicating my database(s) to a second region.
Basic notes on Java garbage collection that every system engineer dealing with Java in production should know.
Long long ago, at DoubleClick, we added to the DART ad server a feature called “dot mode”.
Basically it worked like this: on an ad request, if we have more than a certain # of concurrent threads active, return a 1x1 clear gif (and do no computation or logging). That is, if we are backlogging, don’t serve an ad.
One nuance with the above is that any load balancing system in front of the ad servers needs to know that a “dot” is an error.
Adding this little feature turned out to be a great move. It then became very hard to kill a server with transient load. Further, we get statistics on how things are working. “This server served 20 million ads and 3 dots.” We can look at the ratio and infer things. The ops mentality became a bit about watching for dots instead of watching for complete failures.