These are the notes I collected (and found interesting) during my day at the Velocity Conference
.
General Themes
- I had never heard of it before, but almost every single presenter referenced Ganglia
as a de-facto monitoring system. - It got presented a bunch of different ways, but basically all the big sites that presented stuff today all use on and off (or dial-able) configurations for features. This is not just for release-time of new features, but this can also help them manage their capacity if something is going wrong.
- Many of these talks are available online: http://velocityconference.blip.tv/

“Image Weight Loss Clinic” at Ignite Velocity
- Stop using GIFs. Use PNGs.
- Use data strippers/filters on JPGs. There is a lot of ‘extra’ data included in JPG that aren’t necessary.
- There are bunches of PNG optimizers out there. We should use at least one, if not all of them. (The suggestion was to build a serial pipeline for them.)
- Using the Velocity
page as an example, the presenter was able to reduce the page weight by 30% following these suggestions.
The User and Business Impact of Server Delays, Additional Bytes, and HTTP Chunking in Web Search
- A lot of people loved the empirical data showing that slower sites cost you users, even for differences as small as 200ms. Brady Forrest wrote up a great digestion of this talk: “Bing and Google Agree: Slow Pages Lose Users
“ - The technique that I pulled outta the whole deal is to use HTTP 1.1 Chunked data. This enables a site to deliver the easy-to-compute stuff first (static header?), and the harder-to-compute stuff later.
Fixing Twitter: Improving the Performance and Scalability of the World’s Most Popular Micro-blogging Site
- Uses NTT America
managed hosting - Put Google Analytics on 503 (Fail Whale) and 500 (Robot) pages. Use Google Analytics for failure metrics.
- Configuration Management: Do it ASAP, early & often, ‘cuz you’re gonna need it eventually
- Even their Ops stuff is checked into SVN, and they require code reviews on all their stuff, enforced by SVN pre-commit hooks and using Review Board

- Send emails (or ANYTHING ELSE POSSIBLE) asynchronously
- They recommend using “mkill
“, which monitors for long-running queries and kills them, before the queries kill your site. - Instrument EVERYTHING for timing/performance.
2 Years Later, Loving and Hating the Cloud
- Presented by an engineer from Picnik
. They run a hybrid (part-cloud, part managed) site. - Queues scale nicely in AWS. (I.e. if you are falling behind processing your queue, it is nigh trivial to just bring up another box to deal with the queue.
- In the cloud, you can plan for your average usage, and scale up/down as needed easily. (You don’t need to keep the 6th box at 1% utilization up, do you?)
- Buy hardware in batches, it gives you flexibility. No scrambling if you need a new box when there are extras around. Also waiting for good deals on price fluctuations on hardware.
- If you don’t have a good deletion plan on S3, it can end up costing you $$
- Being in the cloud enables you to ignore the S3 space problem, operationally at least, until it is too expensive (leaving you opportunity to work on other low-hanging fruit)
- Whereas you can get some nice SLA’s when dealing within your own network, latency should be treated as a complete unknown in the cloud.
- Be prepared for some difficult and juicy debugging when using the cloud.
Page Speed
- Twitter is a fantastic feedback mechanism, more so than Google Groups / wikis / forums (me: lower barrier for commentary?)
- Browser Tool like Firebug
- Someone (on Twitter?) made a very apt comment that it’s kinda sad to see Google (Page Speed) and Yahoo! (YSlow) shepherding similar projects, without trying to combine them.
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
- Websites pretty much always ship trunk. Having versions and point releases are vestigial remnants from old shrink-wrapped product lifecycles.
- “Dark launches”, where you use the on/off/variable conditionals to exercise the new backend before it becomes mission/feature-critical.
- Have all deployments notify IRC/IM/Twitter (to internal teams only) so EVERYONE knows what’s going on. Also, keep it around w/timestamps, and make it searchable
- Give ALL developers (at least read-only) access to the prod machines. It helps them help you (Ops) better.
- If there is an outage, EVERYONE stops working on new work. Even they aren’t directly responsible, JR engineers should be working to understand why something is broken. This is a good time for them to learn these diagnostic skills.
- AUTOMATE your INFRASTRUCTURE!!
Scaling for the Expected and Unexpected
- ‘Planned Degradation’ – switch off functionality, this can lighten the load on the back end
- If you hit high (un)expected load it is usually on a single/few page(s), route to a static copy of that page, regenerate every X minutes.
- The simple act of using a proxy server between your appServer and the outside world, even if it is not caching (like Squid/Varnish), is that the appServer is just delivering to a network neighbor, reducing its thread pool contention. This simple fact can have a great positive effect on your server performance.
- Watch out for cache stampedes.
- 3rd Party Resources – Load last, place at bottom of page, in an iframe. If sales doesn’t like it, tell them to go to hell. (Me: whoa.)
Infrastructure in the Cloud Era
- (Me: There were some great slides here, I hope they post them publicly.)
- With provisioning becoming so quick (minutes), we need a quicker way to get these provisioned machines up and running quickly to realize those benefits.
- The real benefit of the cloud is not $$, it is TIME (which you can turn into $$).
- Definition – meatcloud: the humans that run your cloud presence. Noticeably difficult and slow to provision a new resource in your meatcloud.
- A bit of operational philosophy – once you get your provisioning/setup all automated and quick, if a service is misbehaving have a bias towards killing it and recreating an instance, rather than trying to ‘recover’ the problem box.
- When you’ve got Command & Control systems in place, they also need an on/off switch, because sometimes you *do* need to do some manual stuff.
Ajax Performance
- Modify your nodes before you attach them to the DOM. Modifying them after can trigger cascading re-parsing by the browser.
- While most languages have Optimizers, JavaScript doesn’t. You should remove your own common subexpressions / loop invariants / etc.
- Prefer “[array, of, strings].join()” over “array + of + strings” because the “+” operator builds lots of spurious interstitial objects
Building OpenDNS Stats
- This talk was about a High-Write environment, which isn’t as applicable to Gallery, but is applicable to our Metrics app
- Don’t use auto-increment in a high-write environment, as it does a table_lock
Load Balancing Roundup
- This presenter is a committer for Perlbal
and was very up-front that this talk would be heavy on the praise for it. - Graphs that look at load every 30 seconds or more DON’T give you enough info about load on your server. Presenter suggests you watch “top -d 0.5″ for a while to get an idea of your server’s load.
- Presenter and audience agreed that HAProxy
doesn’t work with “keep-alive”
Happy to see you got so much from my image clinic talk, I was afraid it was going too fast.
Loved your prankster talk and the photos on the slides were really funny. And you started talking about the cloudy thing on the wrong slide deck was one of the highlights of the evening
Comment by Stoyan — 26 June 2009 @ 13:53
[...] Velocity Conference Recap [...]
Pingback by theStartup Weekend Reading #12 | TheStartup.eu — 5 July 2009 @ 00:03