Nelz's Blog

Mah blogginess

Velocity 2014 - Day 3 - Thursday

Thursday Keynote


Scott Hanselman (Microsoft) – JavaScript and the new Virtual Machine

  • Video
  • Very funny talk!
  • Nelz idea/analogy – single OS : cell :: the cloud : organ
    • Basically: “The Cloud” is just a VERY LARGE OS with all the systems geographically displaced

Johan Stiennon (Sauce Labs) – Test Driven Mobile Development with Appium, Just Like Selenium

  • Video
  • – “Selenium for Apps” – OSS
  • Built on Node.JS
  • The in-Xcode automation isn’t Jenkins-able
  • Neither is the Android version
  • Can write tests in many languages

Joshua Marantz (Google) – Top 10 Lessons Learned Building PageSpeed and trying to Make The Web Fast

Hala Al-Adwan – (Edgecast) – Web Performance, Why it Really Matters

  • Video
  • 2013 – 66% of all traffic was video
  • (Edgecast likes to misuse memes for their slides… groan)
  • This talk wasn’t a contribution to the art & craft of web ops, it’s all marketing
  • She dropped a little hint of a project “GhostFish”. There is little to no information about such a project, except for a Velocity presentation… Sounds like a “ShadowFilter” that I wrote before.

Lara Swanson (Etsy) – Mobile Web is Not (Just) a Technical Challenge

  • Video / Slides
  • Good talk about using CARROTS to change culture, rather than STICKS

Lightning Demos

  • Radware
    • Video
    • Marketing… $eyeroll
    • Video
    • Pretty cool OSS tool
    • Command-line
    • Outputs result in accepted unit test formats
    • Built on top of many other OSS projects
  • Fiddler
    • Video
    • Web debugging proxy
    • OSS project
    • Eric got hired by Telerik to provide Fiddler-as-a-Service
    • MS-based project, so OS X is shaky… But can run well on Linux
    • Works well as a proxy, but not all devices allow proxy
    • Enabled doing import from packet-capture files
    • Can plug-in other web perf tools

James Colgan (Rackspace) – Building Self-Adaptive Autonomous Infrastructure with an Advanced Monitoring Architecture

  • Video
  • Take the people out of the scaling lifecycle
  • Super high level with almost zero specifics

Ernest Mueller (Copperegg) – A 5 Minute Checklist for Application Monitoring

  • Video / PPT
  • Number 1 Priority: Service Performance and Uptime
  • “Lean Monitoring” – pretty cool that he didn’t just plug his company
  • Which order you should be establishing monitoring in

Patrick Lightbody (New Relic) – Software Analytics for Performance Nerds

“Performance and Maintainability with Continuous Experimentations”



Client-side Feature Flags (YAY!!)

“Continuous Experimentation”

Small changes – Validate/Invalidate hypotheses

NB: Built into their experimentation results tool is a quantifying of whether if something is “statistically significant”

control vs test, number of hits, %difference

Not just A/B, but multi-variants (aka A/B/C/…)

Tombstone technique (see previous mention) in JS is available, but not in CSS

“Cognitive overhead” can be considered a performance problem

Design for “Throwaway-ability

Static analysis is available on JS, too… And “unused selectors” on CSS.

“Testing Your Mobile App for Real-World Network Conditions”

Slides (zip/ppt)

Network aware image loading
  • Do apps, like web, have access to 4G/3G/2G information?
  • Offer/grab lower-res (smaller) images when on the different networks
Network Aware: Latency
  • Get closer to customers
  • CDN
  • AT&T has Public IP Range document
  • Surround your code with measurements, use that info for conditional behavior
  • If things are slow, prefetch earlier

Roaming – If customer is roaming, offer a ‘lighter’ site



AWS –> FB Servers

Managed the migration, and result in good vibes on both IG and FB side

Year 1: Everything is on fire!
  • bad querys
  • inadequate caching
  • etc…
Year 2: Now in order
  • Django
  • Ubuntu
  • Postgres
  • Memcached
  • Redis
Fabric for all rollouts, few AWS-specific
  • S3
  • ELB
  • AutoScaling
  • AMIs for provisioning
  • ec2-tags for organization

AWS => rapid prototyping

AWS => Bad I/O patterns

Year 3: Chef & distributed
  • most systems horizontally scaled
  • One AMI, dozens of Chef recpies

First integration: using FBs anti-spam system, new functionality for IG

How can we get benefits of years of work from months of integration

  • Simplification
  • VPC
  • IP Shenanigans were hard
  • Into FB
  • team connections far more important than network connections
  • embed FB prod engineers into IG team
  • substantially different chef environment
  • No major changes despite having punched holes in the wall

“Delivering Optimal Images for Phone and Tablets on the Modern Web”

Balancing UX vs Ease of Development

Game Plan
  • Reduce bandwidth and latency, particular on small mobile devices
  • Deliver high-res image to modern tablets & laptops
  • Maximize system performance using CDNs and proxy-caches
  • Site owners should not need to be web tech experts
  • Easy: <img src=...>, but is not going to make all of your users happy
Changing Landscape: Webp, JPEG-XR, SVG
  • Fewer bytes is generally better
  • modulo decompression cost
  • modulo memory of decompressed image
  • Goal is to render above-the-fold page in as few bytes as possible
  • pref 1 CWND of data (~15k)
Changing Landscapes: CSS
  • CSS pixel != device pixel
Changing Landscape: HTML Image
  • <picture> (coming)
  • srcn: (considered dead)
  • X-Clients-Hints (???)
Partial Solution
  • <img src="..." srcset="..." />
  • unimplemented browsers default to original image
  • might go wonky with JS enhancements to image

mod-pagespeed –> Apache/nginx plugin

WebP images are so small, you might even consider bringing them over as encoded Bas64

IE does some weird stuff, so options in ModPageSpeed to turn off for IE

EC2 strips User-Agent & Accept headers – must serve distinct URLs for webp

“Operational Costs of Technical Debt”


Tech debt accumulates from a series of small choices

Infrastructure becomes technical debt by focusing on shiny new features

Past decisions become debt unless they are updated to reflect new realities

“One in a million” happens multiple times per hour or minutes…

System outages and errors increase

Prove out that not addressing debt == $$

Teams develop work-around and procedures that are worse than the original problem

Tech Debt devalues ops in favor of new feature dev

“Broken window” theory

Supporting zombies leads to finger-pointing and avoidance

Tech Debt leads to demoralization

How to balance retiring technical debt vs other choices

Measure the right things
  • Time and Effort to Repair
  • (Mean hides long tail info)
  • frequency/severity/reach
  • error rates

Evaluate all the costs: either to fix or to tolerate

Indirect costs are hard to evaluate

Make active decisions

Moving beyond the debt crisis