web ops RSS

Web infrastructure operations and scaling tips and discussion

Archive

Jun
23rd
Thu
permalink
May
2nd
Mon
permalink
On some ssd devices, “smartctl -A” will show you the Media_Wearout_Indicator.
Apr
29th
Fri
permalink
permalink

EBS Durability

See: http://aws.amazon.com/ebs/

It’s pretty clearly disclosed that EBS volumes aren’t terribly durable. So in fairness, it is what it is and is disclosed as it is.

An annual failure rate for a volume of 0.1% – 0.5% is mentioned. That means you should be prepared to lose a volume: it could happen. Backups / snapshots would suffice if you don’t mind losing your data since the last backup (unlikely). If you do mind, use database replication to a separate availability zone or region.

An AFR of 0.1% – 0.5% is quite a bit lower than that of a single drive, but is pretty high compared to a high end SAN product’s probability of losing data. Of course, those can be extremely expensive so this is a bit apples and oranges.

For me, for most problems I’d be quite comfortable using EBS if I’m continuously replicating my database(s) to a second region.

Apr
13th
Tue
permalink
Apr
12th
Mon
permalink

Java GC Basics

Basic notes on Java garbage collection that every system engineer dealing with Java in production should know.

  • Always run with gc logging enabled.
  • Watch for promotion failures in the gc log.  If you are have any, the heap is fragmented, restart the process.
  • GC time seems to be proportional to size of the heap.  GC’s will be slow with a 10GB heap size.
  • The CMS collector has better concurrency, but it is VERY slow.  To use it your application may have to do a lot of manual object pooling.
  • 10% or 20% of time spent in GC is reasonable.  50% is not.  99% means you are out of memory.
  • Be sure to use -server so that multiple cores are used for Full GC’s.
  • -Xmx is the really important parameter.  -Xms not so much.
Apr
10th
Sat
permalink

Graceful Degradation

Long long ago, at DoubleClick, we added to the DART ad server a feature called “dot mode”.

Basically it worked like this: on an ad request, if we have more than a certain # of concurrent threads active, return a 1x1 clear gif (and do no computation or logging).  That is, if we are backlogging, don’t serve an ad.

One nuance with the above is that any load balancing system in front of the ad servers needs to know that a “dot” is an error.

Adding this little feature turned out to be a great move.  It then became very hard to kill a server with transient load.  Further, we get statistics on how things are working.  “This server served 20 million ads and 3 dots.”  We can look at the ratio and infer things.  The ops mentality became a bit about watching for dots instead of watching for complete failures.

Apr
7th
Wed
permalink