Windows: a software engineering odyssey

11 min read Original article ↗

This is a text transcription of the slides from the "Windows: a software engineering odyssey" talk given on Microsoft culture by Mark Lucovsky in 2000. This is hosted here because I wanted to link to the slides, but the only formats available online were powerpoint and slide-per-page HTML where each page is basically a screenshot of a powerpoint slide. If you're looking for something on current Microsoft culture, try these links.

Agenda

  • History of NT
  • Design Goals/Culture
  • NT 3.1 vs. Win2k
  • The next 10 years

NT timeline: first 10 years

  • 2/89: Coding begins
  • 7/93: NT 3.1 ships
  • 9/94: NT 3.5 ships
  • 5/95: NT 3.51 ships
  • 7/96: NT 4.0 ships
  • 12/99: NT 5.0 a.k.a. Windows 2000 ships

Unix timeline: first 20 years

  • 69: coding begins
  • 71: first edition -- PDP 11/20
  • 73: fourth edition -- rewritten in C
  • 75: fifth edition -- leaves Bell Labs, basis for BSD 1.x
  • 79 -- one of the best
  • 82 System III
  • 84 4.2 BSD
  • 89 SVR4 unification of Xenix, BSD, System V
    • NT development begins

History of NT

  • Team forms 11/89
  • Six guys from DEC
  • One guy from MS
  • Built from the ground up
    • Advanced PC OS
    • Designed for desktop & server
    • Secure, scalable, SMP design
    • All new code
  • Schedule: 18 months (only missed our date by 3 years)

History of NT, cont.

  • Initial effort targeted at Intel i860 code-named N10, hence the name NT which doubled as N-Ten and New Technology
  • Most dev done on i860 simulator running OS/2 1.2
  • Microsoft built a single board i860 computer code-named Dazzle, including the supporting chipset; ran full kernel, memory management, etc. on the machine
  • Compiler came from Metaware with weekly UUCP updates sent to my Sun-4/200
  • MS wrote a PE/Coff linker and a graphical cross debugger

Design longevity

  • OS code has a long lifetime
  • You have to base your OS on solid design principles
  • You have to set goals; not everything can be at the top of the list
  • You have to design for evolution in hardware, usage patterns, etc.
  • Only way to succeed is to base your design on a solid architectural foundation
  • Development environments never get enough attention

Goal setting

  • First job was to establish high level goals
    • Portability: ability to target more than one processor, avoid assembler, abstract away machine dependencies. Purposely started the i386 port very late to avoid falling into a typical Microsoft x86 centric design
    • Reliability: nothing should be able to crash the OS. Anything that crashes the OS is a bug. Very radical thinking inside MS considering Win16 was co-operative multi-tasking in a single address space, and OS/2 had similar attributes with respect to memory isolation
    • Extensibility: ability to extend OS over time
    • Compatibility: with DOS, OS/2, POSIX, or other popular runtimes; this is the foundation work that allowed us to invent windows two years into NT OS/2 development
    • performance: all of the above are more important than raw speed!

NS OS/2 design workbook

  • Design of executive captured in functional specs
  • Written by engineers, for engineers
  • Every functional interface was defined and reviewed
  • Small teams can do this efficiently
    • Making this process scale is an almost impossible challenge
    • Senior developers are inundated with spec reviews and the value of their feedback becomes meaningless
    • You have to spread review duties broadly and everyone must share the culture

Developing a culture

  • To scale a dev team, you need to establish a culture
    • Common way of evaluating designs, making tradeoffs, etc.
    • Common way of developing code and reacting to problems (build breaks, critical bugs, etc.)
    • Common way of establishing ownership of problems
  • Goal setting can be the foundation for the culture
  • Keeping culture alive as a team grows is a huge challenge

The NT culture

  • Portability, reliability, security, and extensibility ingrained as the teams top priority
    • Every decision was made in the context of these design goals
  • Everyone owns all the code, so whenever something is busted anyone has a right and a duty to fix it
    • Works in small groups (< 150 people) where people cover for each other
    • Fails miserably in large groups
  • Sloppiness is not tolerated
    • Great idea, but very difficult to nurture as group grows
    • Abuse and intimidation gets way out of control; can't keep calling people stupid and except them to listen
  • A successful culture has to accept that mistakes will happen

NT 3.1 vs. Windows 2000

  • Dev teams
  • Source control
  • Process management
  • Serialized development
  • Defects

Development team

  • NT 3.1
    • Starts small (6), slowly grows to 200 people
    • NT culture was commonly understood by all
  • Windows 2000
    • Mass assimilation of other teams into the NT team
    • NT 4.0 had 800 developers, Windows 2000 had 1400
    • Original NT culture practiced by the old timers in the group, but keeping the culture alive was difficult due to growth, physical separation, etc.
    • Diluted culture leads to conflict
      • Accountability: I don't "own" the code that is busted, see Mark!
      • reliability vs. new features
      • 64-bit portability vs. new features

Source control system (NT 3.1)

  • Internally developed, maintained by a non-NT tools team
    • No branch capability, but not needed for small team
  • 10-12 well isolated source "projects", 6M LOC
  • Informal project separation worked well
    • minimal obscure source level dependencies
  • Small hard drive could easily hold entire source tree
  • Developer could easily stay in sync with changes made to the system

Source control system (Windows 2000)

  • Windows team takes ownership of source control system, which is on life support
  • Branch capability sorely needed, tree copies used as substitutes, so merging is a nightmare
  • 180 source "projects", 29M LOC
  • No project separation, reaching "up and over" was very common as developers tried to minimize what they had to carry on their machines to get their jobs done
  • Full source base required about 50Gb of disk space
  • To keep a machine in sync was a huge chore (1 week to set up, 2 hours per day to sync)

Process management (NT 3.1)

  • Safe sync period in effect for 4 hours each day; all other times, the rule is check-in when ready
  • Build lab syncs during morning safe sync period, which starts a complete build
    • Build breaks are corrected manually during the build process (1-2 breaks were normal)
  • Complete build time is 5 hours on 486/50
  • Build is boot tested with some very minimal testing before release to stress testing
    • Defects corrected with incremental build fixed
  • 4pm, stress testing on ~100 machines begins

Process management (Windows 2000)

  • Developers not allowed to change source tree without explicit, email/written permission
    • Build lab manually approves each check-in using a combination of email, web, and a bug tracking database
  • Build lab approves about 100 changes each day and manually issues the appropriate sync and build commands
    • Build breaks are corrected manually; when they occur, all further build processing is halted
    • A developer that mistypes a build instruction can stop the build lab, which stops over 5000 people
  • Complete build time is 8 hours on 4-way PIII Xeon 550 with 50Gb disk and 512k cache
  • Build is boot tested and assuming we get a boot, extensive baseline testing begins
    • Testing is a mostly manual, semi-automated process
    • Defects occurring in the boot or test phase must be corrected before the build is "released" for stress testing
  • 4pm, stress testing on ~1000 machines begins

Team size

Product Devs Testers
NT 3.1 200 140
NT 3.5 300 230
NT 3.51 450 325
NT 4.0 800 700
Win2k 1400 1700

Serialized Development

  • The model from NT 3.1 to 2000
  • All developers on team check in to a single main line branch
  • Master build lab syncs to main branch and builds releases from that branch
  • Checked in defect affects everyone waiting for results

Defect rates and serialization

  • Compile time or run time bugs that occur in a dev's office only affect that dev
  • Once a defect is checked in, the number of people affected by the defect increases
  • Best devs are going to check in a runtime or compile time mistake at least twice a year
  • Best devs will be able to code with a checked in compile time or run time break very quickly (20 minutes end-to-end)
  • As the code base gets larger, and as the team gets larger, these numbers typically double

Defect rates data

  • With serialized development
    • Good, small, teams operate efficiently
    • Even the absolute best large teams are always broken and always serialized
Product Team # Defects/dev-yr Fix time / defect Defects / day Total fix time
NT 3.1 200 2 20m 1 20m
NT 3.5 300 2 25m 1.6 41m
NT 3.51 450 2 30m 2.5 1.2h
NT 4.0 800 3 35m 6.6 3.8h
Win2k 1400 4 40m 15.3 10.2h

Dev environment summary

  • NT 3.1
    • Fast and loose; lots of fun & energy
    • Few barriers to getting work done
    • Defects serialized as parts of the process, but didn't stop the whole machine; minimal downtime
  • Windows 2000
    • Source control system bursting at the seams
    • Excessive process management serialized the entire dev process; 1 defect stops 1400 devs, 5000 team members
    • Resource required to build a complete instance of NT were excessive, giving few developers a way to be sucessful

Focused fixes

  • Source control
  • Source code restructuring
  • Make the large team work like a set of small teams
    • Windows is already organized into reasonable sized dev teams
    • Goal is to allow these teams to work as a team when contributing source code changes rather than as a group of individuals that happen to work for the same VP
    • Parallel development, team level independence
  • Automated builds

Source control system

  • New system identified 3/99 (SourceDepot)
  • Native branch support
  • Scalable high speed client-server architecture
  • New machine setup 3 hours vs. 1 week
  • Normal sync 5 minutes vs. 2 hours
  • Transition to SourceDepot done on live Win2k code base
  • Hand built SLM -> SourceDepot migration system allowed us to keep in sync with the old system while transitioning to SourceDepot without changing the code layout.

Source code restructuring

  • 16 depots for covering each major area of source code
  • Organization is focused on:
    • Minimizing cross project dependencies to reduce defect rate
    • Sizing projects to compile in a reasonable about of time
    • To build a project, all you need is the code for that project and that public/root project
    • Cross project sharing is explicit

New tree layout

  • The new tree layout features
    • Root project houses public
    • 15 additional projects hang off the root
    • No nested projects
    • All projects build independently
    • Cross project dependencies resolved via public, public/internal usnig checked in interfaces

Team level independence

  • Each team determines its own check-in policy, enable rapid, frequent check ins
  • Teams are isolated from mistakes by other teams
    • When errors occur, only the tema causing the error is affected
    • A build, boot, or test break only affects a small subset of the product group
  • Each team has their own view of the source tree, their own mini build lab, and builds and entire installable build
  • Any developer with adequate resources can easily duplicate a mini build lab
    • Build and release a completely installable Windows system
  • Teams integrate their changes into the "main" trunk one at a time, so there is a high degree of accountability when something goes wrong in "main"
  • Build breaks will happen, but they are easily localized to the branch level, not the main product codeline
  • Teams are isolated from mistakes made by other teams
    • When errors occur, they affect smaller teams
    • A build, boot, or test break only affects a small subset of the Windows development team
  • Each team has their own view of the source tree and their own mini buikld lab
    • Each team's lab is enlisted in all projects and builds all projects
    • Each team needs resources able to build an NT system
  • Each team's build lab builds, tests, and mini-bvt's a complete standalone system

Automated builds

  • Build lab runs 100% hands off
  • 10am and 10pm full sync and full build
    • Build failures are auto detected and mailed to the team
    • Sucessful builds are automatically released with automatic notification to the team
  • Each VBL can build:
    • 4 platforms (x86 fre/chk, ia64 fre/chk) = 8 builkds/day, 56/week
    • No manual steps at all
    • 7 VBLs in Win2k group
    • Majority of builds work, but failures when they occur are isolated to a single team

Productivity gains

  • Developers can easily switch from working on release N to release N+1
  • Developers in one team will not be impacted by mistakes/changes made by other teams
  • Developers have long, frequent checkin windows (Base team has 24x7 checkin window with manual approval used during Win2k)
  • Source control system is fast and reliable
  • Testing is done on complete builds instead of assorted collections of private binaries
    • What is in the source control system is what is tested