Nelz's Blog

Mah blogginess

Velocity 2014 - Day 2 - Wednesday

Wednesday Keynote

Jeff Dean (Google) – Achieving Rapid Response Times in Large Online Services

  • Video
  • Prioritize ‘interactive’ (vs background) processing
  • Break large requests into sequence of small requests (head-of-line blocking)
  • Rate-limit/defer (until low load) expensive background processes
  • Tolerate faults: rely on extra resources
  • Tolerate variability: rely on the same extra resources
  • Selective Replication
    • static – more replicas of important docs
    • dynamic – more replicas of Chinese documents as Chinese queries increase
  • Latency-Induced Probation
    • Servers sometime slow to respond
    • remove capacity under load to improve latency
    • continue sending shadow stream of requests to server to continue measuring & return when better
  • Poison-pill request: don’t send to ALL servers, send to a canary or two before sending to ALL
  • Backup Requests: Send to primary replica, but if kind of slow, send a backup to secondary replica, then cxl primary req (can be costly, but not too bad in practice)
    • Triple-reads don’t really add more benefit… Double-read gives enough
  • I found this to be a really excellent talk

Ben Rushlo (Keynote) – Performance In Context – Is “Good” Good Enough?

  • Video
  • “It’s about the customer”
  • Regionality comment: In India, they are used to poor performance, so the user behavior differs (less bounce per latency)
  • Does business consider Performance as a Feature?

Saurabh Bajaj (Neustar) – Exponential Load Testing: Multiply the Power, Multiply the Results

  • Video
  • Multi-cloud – It’s like diversifying stocks
  • IT transitioning to BT (Business Technology)

Pamela Fox (Khan Academy) – Lowering the Barrier to Programming

  • Video / Slides
  • @pamelafox
  • Girl Develop It – SF
  • We can’t rely on luck anymore, we need to lower the barriers
  • Barrier 1 – Access to a computer (DonorsChoose.org / CSUnplugged)
  • Barrier 2 – Local Dev Setup –> We need more online programming environments
    • Online: Some languages available. Hardware? Other languages?
  • Barrier 3 – CS Classes [http://code.org/learn/local]
  • Barrier 4 – Social Encouragement
    1. Parental
    2. Familial
    3. Peer
  • Barrier 6 – Career Misconceptions
  • Barrier N+ – Many vary by environment
  • TODO: Lower the barrier for one kid to learn to code!

Rodney Mullin (Professional Skateboarder) – Building on a Bedrock of Failure

  • Video
  • How you deal with failure makes the difference between good & great
  • Best skaters = best fallers (usually)
  • Finely tuned things can fall out of tune very quickly
  • “Natural range of recoverability”
  • It is unrealistic to expect constant progress
  • The importance of failure, and the bravery of getting up again

Mark Zeman (SpeedCurve) – Responsive Web Performance in the Wild

  • Video
  • http://speedcurve.com – Front End Performance Monitoring
  • Sits on top of WebPageTest
  • Very pretty visuals of history and capabilities
  • 38% of sample sites do no optimization for mobile
  • 55% change strategies for image
  • Very little optimization by number of requests

Cheryl Ainoa (Intuit) – How to Adapt and Innovate for 2018

  • Video
  • Keeping pace with Technology Disruptions
  • Solving the Customer Problem
    1. Have deep customer empathy
    2. Keep one eye on the future

“Understanding Slowness”

@postwait

Slides: https://speakerdeck.com/postwait/understanding-slowness

“Slow is the new down.”

Time to fix outages directly correlates to how broad the engineer’s focus is.

Senior Engineers are more cognizant of the steaming pile that is their architecture

Map #1 – High-level map
  • Architectural Components
  • Connectedness
  • Data flow
Map #2 – Low-level map
  • Component versions
  • Component languages
  • OS/NICs/HBAs
  • Location
  • Switches/Routers/FW
  • Connected Service details
2 Types useful SREs:
  • Spanning several boundaries (deep, not wide)
  • Spanning all boundaries (wide, not deep)
Who’s On First?
  • Establish who is responsible for each component in each context
  • Establish who is responsible when that person fails (upward)
  • Establish who is responsible when that person needs help (upward & downward)
  • GAME DAY EXERCISES mitigate these challenges
Expectations
  • Set expectations for breakages and slowdowns
  • What you build will break, understanding under what stress is your job as an engineer
  • (Choosing which of these that you need to know is part of the challenge in small/all companies)
0 Tech Loyalty
  • Construct a solution from parts
  • Parts are replaceable
  • (Different parts by different providers will have different tolerances, which can be good for your infrastructure… Ex. If one provider is failing at X tolerance, maybe the other will fail at a different point?)
Logistics matter (when things are broken [or slow])
  • Observability
  • Tool parity
  • Safety harnesses (you can change production code in a defined/protected scope)
Latency
  • You must subdue it
  • First, you must understand it
Histograms over Aggregations
  • Averages are for chumps
  • Reduce many observations S to N values is the definition of “lossy”
  • AKA “you don’t know shit”
Quantiles
  • Time series histograms are a lot of information to digest
  • Moving quantiles can often provide much more insight
  • Min & Max are the most valuable quantiles if you only have 2 – it bounds reality for you
Granular Data
  • Time consolidation is needed
  • it can me misleading
  • ask good statistical questions
  • (Ex. It is impossible to analyze 88M requests received in a day)
Work backwards
  • At what quantile are you?
  • I.e. “1000 millisecond SLA is at what quartile in my distribution?”
Tools
  • Tools do not a master craftsman make
  • Regardless, know your damn tools
Observation
  • Taking measurements
  • Not for making changes
Synthesis
  • Should be last resort
Hitting the wall
  • Latent latency
  • Stuff goes wonky at 3PM… well, we made a change at 10am

“Minimum Viable Bureaucracy”

Slides: https://speakerdeck.com/lauraxt/minimum-viable-bureaucracy-june-2014-edition

Arch scales linearly for a while –> then blows up –> re-architecure and continue –> team/company size is the same

“Phase changes” at 10/50/200/etc…

Chaord / trust / automomy
  • has chaos, has order
  • traditional: “get your ducks in a row”
  • chaord: “self-organizing ducks”
Basis of any self-organizing system: TRUST
  • “nobody comes to work to do a bad job”
How to build trust
  • time: many small deposits in the trust bank
  • start by trusting others
  • be trustworthy
  • build relationships outside of tense environments
  • (She said “‘Trust but verify’? No, ‘Trust’”)
  • Hire people you believe you can trust (to do a good job), then trust them
Practicalities
  • awesome communication practices –> requires practice
  • over-communicate
  • effective remote teams are good at this
  • write/record more things
  • asynchrony is key
  • “We are all remoties”
  • remote teams enable hiring the best people
  • good communication practices + high levels of trust = remote effectiveness
  • chances are, you will end up with more than one office
Tactics:
  • shared communication spaces
  • all work should have URLs
  • IRC/campfire/whatever
  • etherpads and wikis
  • bug tracking
  • email
  • record meetings (video / audio / shared notes)
  • record decisions

Have conversations, more than meetings

If ACTUAL meeting
  • communally editable agenda (skip if empty)
  • limit # of participants
  • limit length
  • clustered: “maker’s schedule”
  • record (asynchrony)
  • take notes (asynchronous)
Minimum Viable (Project) Documentation
  • how to install
  • how to create and ship
  • roadmap
  • glossary
  • where to get help
Borrow from OSS
  • SMEs emerge
  • make sure they have a second (no SPOF)
  • rotate unwanted responsibilities
  • If everyone hates a thing, can’t we get rid of it?
Remember:
  • self-organizing does NOT equal democracy
  • self-organizing does NOT equal anarchy
Many architectural problems are bikesheds
  • allow primary owner of motivated champion time to make a prototype
  • … while not working on other stuff
  • … don’t let it go too long
“Come with code”
  • many of the hardest problems and rewrites are 80% done by one person in a two day marathon
  • … then the other 80%…
A portion of each engineer’s time must be spent on what that engineer things is most important…
  • may be 100%
  • or 60%, 40%, 20%
  • but NEVER zero
REMEMBER
  • push responsibilities to the edges
  • open source models
  • freedom to innovate
Collaborative architecture design:
  • non-bikesheds
  • brainstorm the interfaces, split up and go
Arch Goals
  • decouple
  • agree on interfaces, then split up and go
Ops problems
  • evidence > gut feelings
  • problem immersion
  • operational mindset is a key skill
  • “Oh, it must be the DB”… “Oh, have you measured that?”

MVP applies to EVERYTHING

Anti-estimation:
  • prototype
  • timebox prototype
  • timebox specs
  • timebox for shipping
  • do hardest parts first

Need to balance Perfectionist and Pragmatists

Iterate on process like on your product

If something’s not working for people, fix it, not “some day”, NOW

Why managers at all?
  • “founders”, “team leads”, “managers”, emergent leader
  • there is no such thing as structureless organization
  • structure may be emergent, but it is there
  • leaders guide emergent constructive behavior
Servant Leadership Model
  • be humble
  • you are and enabler
  • enabling is more important than doing
  • (but don’t stop coding)
  • introverts make great servant leaders

Manager job is to be visible as a representative of your company

ask questions instead of giving answer

Socratic/Confucian/Rabbinical method

Stumbles
  • NO SURPRISES
  • shorten feedback cycle: instant, explicit
  • every reason and person is different
  • boredom/burnout/depression/crises/cruisers
  • help them fix it, or help them find somewhere they will be happier

3 key things: mastery, autonomy, purpose

“A Look at Looking in the Mirror: Post-Mortems”

@SoberBuildEng / ShipShowPodcast Related Video

Analogy – Dev : Pilots :: Ops : Air Traffic Control

“Five Why’s” model is based on the “Sequence of Events” model… (Falling out of favor now.)

“Epidemiological Model” (aka “Swiss Cheese Model”)
  • acknowledges different subsystems in play
  • different types of “failure” – Latent / Active
  • Accounts for various system actors

Hindsight Bias / Outcome Bias / Correspondence Bias

“Systemic Model”

Tool of note: tmate
  • “Instant terminal sharing”
  • Useful when working on an ops problem, everyone can see the same thing as the user

“The Cultural Implications of Every Major Technology Decision”

@drabinovich

Slides: http://www.slideshare.net/DanielRabinovich/daniel-rabinovich-velocity-2014-santa-clara

Passive/Defensive Culture: Compliance becomes more important than achievements

Other slides that may be of interest – API Design Choices

“How to be Great at Operations”

@adamhjk

Slides (Open-Sourced!): https://github.com/adamhjk/good-at-ops

“Some Simple Math to get Some Signal out of Your Ops Data Noise”

@tboubez

Video from a different conference: http://vimeo.com/95069158

Idea that dawned on me at this talk
  • come up with your candidate math that will trigger alerts,
  • then apply to a known historical dataset
  • and graph the meaningful parts/boundaries of it
  • see if you alert more/less/better
Kolomogorov-Smirnov
  • Awesome algorithm for evaluating now vs previous data
  • Not implemented in tools
  • Not operationalized for real-time
Tools of note: