Nelz's Blog

Mah blogginess

Velocity Conference Recap

These are the notes I collected (and found interesting) during my day at the Velocity Conference.

General Themes

  • I had never heard of it before, but almost every single presenter referenced Ganglia as a de-facto monitoring system.
  • It got presented a bunch of different ways, but basically all the big sites that presented stuff today all use on and off (or dial-able) configurations for features. This is not just for release-time of new features, but this can also help them manage their capacity if something is going wrong.
  • Many of these talks are available online:

“Image Weight Loss Clinic” at Ignite Velocity

  • Stop using GIFs. Use PNGs.
  • Use data strippers/filters on JPGs. There is a lot of ‘extra’ data included in JPG that aren’t necessary.
  • There are bunches of PNG optimizers out there. We should use at least one, if not all of them. (The suggestion was to build a serial pipeline for them.)
  • Using the Velocity page as an example, the presenter was able to reduce the page weight by 30% following these suggestions.

The User and Business Impact of Server Delays, Additional Bytes, and HTTP Chunking in Web Search

  • A lot of people loved the empirical data showing that slower sites cost you users, even for differences as small as 200ms. Brady Forrest wrote up a great digestion of this talk: ”Bing and Google Agree: Slow Pages Lose Users
  • The technique that I pulled outta the whole deal is to use HTTP 1.1 Chunked data. This enables a site to deliver the easy-to-compute stuff first (static header?), and the harder-to-compute stuff later.

Fixing Twitter: Improving the Performance and Scalability of the World’s Most Popular Micro-blogging Site

  • Uses NTT America managed hosting
  • Put Google Analytics on 503 (Fail Whale) and 500 (Robot) pages. Use Google Analytics for failure metrics.
  • Configuration Management: Do it ASAP, early & often, ‘cuz you’re gonna need it eventually
  • Even their Ops stuff is checked into SVN, and they require code reviews on all their stuff, enforced by SVN pre-commit hooks and using Review Board
  • Send emails (or ANYTHING ELSE POSSIBLE) asynchronously
  • They recommend using ”mkill”, which monitors for long-running queries and kills them, before the queries kill your site.
  • Instrument EVERYTHING for timing/performance.

2 Years Later, Loving and Hating the Cloud

  • Presented by an engineer from Picnik. They run a hybrid (part-cloud, part managed) site.
  • Queues scale nicely in AWS. (I.e. if you are falling behind processing your queue, it is nigh trivial to just bring up another box to deal with the queue.
  • In the cloud, you can plan for your average usage, and scale up/down as needed easily. (You don’t need to keep the 6th box at 1% utilization up, do you?)
  • Buy hardware in batches, it gives you flexibility. No scrambling if you need a new box when there are extras around. Also waiting for good deals on price fluctuations on hardware.
  • If you don’t have a good deletion plan on S3, it can end up costing you $$
  • Being in the cloud enables you to ignore the S3 space problem, operationally at least, until it is too expensive (leaving you opportunity to work on other low-hanging fruit)
  • Whereas you can get some nice SLA’s when dealing within your own network, latency should be treated as a complete unknown in the cloud.
  • Be prepared for some difficult and juicy debugging when using the cloud.

Page Speed

  • Twitter is a fantastic feedback mechanism, more so than Google Groups / wikis / forums (me: lower barrier for commentary?)
  • Browser Tool like Firebug
  • Someone (on Twitter?) made a very apt comment that it’s kinda sad to see Google (Page Speed) and Yahoo! (YSlow) shepherding similar projects, without trying to combine them.

10+ Deploys Per Day: Dev and Ops Cooperation at Flickr

  • Websites pretty much always ship trunk. Having versions and point releases are vestigial remnants from old shrink-wrapped product lifecycles.
  • “Dark launches”, where you use the on/off/variable conditionals to exercise the new backend before it becomes mission/feature-critical.
  • Have all deployments notify IRC/IM/Twitter (to internal teams only) so EVERYONE knows what’s going on. Also, keep it around w/timestamps, and make it searchable
  • Give ALL developers (at least read-only) access to the prod machines. It helps them help you (Ops) better.
  • If there is an outage, EVERYONE stops working on new work. Even they aren’t directly responsible, JR engineers should be working to understand why something is broken. This is a good time for them to learn these diagnostic skills.

Scaling for the Expected and Unexpected

  • ‘Planned Degradation’ - switch off functionality, this can lighten the load on the back end
  • If you hit high (un)expected load it is usually on a single/few page(s), route to a static copy of that page, regenerate every X minutes.
  • The simple act of using a proxy server between your appServer and the outside world, even if it is not caching (like Squid/Varnish), is that the appServer is just delivering to a network neighbor, reducing its thread pool contention. This simple fact can have a great positive effect on your server performance.
  • Watch out for cache stampedes.
  • 3rd Party Resources - Load last, place at bottom of page, in an iframe. If sales doesn’t like it, tell them to go to hell. (Me: whoa.)

Infrastructure in the Cloud Era

  • (Me: There were some great slides here, I hope they post them publicly.)
  • With provisioning becoming so quick (minutes), we need a quicker way to get these provisioned machines up and running quickly to realize those benefits.
  • The real benefit of the cloud is not $$, it is TIME (which you can turn into $$).
  • Definition - meatcloud: the humans that run your cloud presence. Noticeably difficult and slow to provision a new resource in your meatcloud.
  • A bit of operational philosophy - once you get your provisioning/setup all automated and quick, if a service is misbehaving have a bias towards killing it and recreating an instance, rather than trying to ‘recover’ the problem box.
  • When you’ve got Command & Control systems in place, they also need an on/off switch, because sometimes you *do* need to do some manual stuff.

Ajax Performance

  • Modify your nodes before you attach them to the DOM. Modifying them after can trigger cascading re-parsing by the browser.
  • While most languages have Optimizers, JavaScript doesn’t. You should remove your own common subexpressions / loop invariants / etc.
  • Prefer “[array, of, strings].join()” over “array + of + strings” because the “+” operator builds lots of spurious interstitial objects

Building OpenDNS Stats

  • This talk was about a High-Write environment, which isn’t as applicable to Gallery, but is applicable to our Metrics app
  • Don’t use auto-increment in a high-write environment, as it does a table_lock

Load Balancing Roundup

  • This presenter is a committer for Perlbal and was very up-front that this talk would be heavy on the praise for it.
  • Graphs that look at load every 30 seconds or more DON’T give you enough info about load on your server. Presenter suggests you watch “top -d 0.5” for a while to get an idea of your server’s load.
  • Presenter and audience agreed that HAProxy doesn’t work with “keep-alive”