Lessons from the Battlefield: Scaling an Engineering Organization in Silicon Valley

6 min read Original article ↗

In early March 2007, I decided to leave my cushy, well-paying software engineering job at Google to join my brother on a startup adventure at Yelp. At the time, the company was just getting started across the U.S., San Francisco had the beginnings of a strong community and New York was just starting to write its first reviews. (Crazy to think that we’re now over 90 million reviews on the platform!) Back then, the number of local searches for burritos, Italian food and doctors were quite small. As a software engineer transitioning out of Google-scale big data (>1 billion of queries per day), this was small data (<200,000 queries per day).

I joined Yelp as the 7th engineer, when Neil Kumar was our VP of Engineering and Russell Simmons was CTO and co-founder. At a large company like Google, you get lulled into believing that resources are infinite because you can’t ever see everyone in a single building. At Yelp, the entire company spanned two floors of a small brick building in San Francisco’s SOMA district. The engineering team was one pod of desks and there was no time-sharing the on-call rotation. In other words, I was always on-call.

2011

I took over lead responsibility for the engineering team in 2011. The team was then 30 folks who covered five core areas: search, consumer experience, business owner tools, operations, and internal apps. Five years later, I now oversee a team of over 300 engineers with an even deeper expertise in those, and many more, focus areas...and I’ve learned a lot of lessons on the growth battlefield that has stretched these past eight years.

Having technical recruiting report into engineering = sleep better   

One of our early successful org experiments was to have technical recruiting report directly into the engineering organization. This experiment and transition was tough. Hiring technical recruiters is very difficult, as is managing employees outside of engineering since you need to learn an extra set of management skills beyond your core expertise. Any delays in your recruiting process or delivering feedback to recruiters will cost your organization a potentially game-changing employee. I lost a lot of sleep knowing we were missing out on strong engineers due to a slow and unoptimized process.

In Silicon Valley, there’s a good (almost guaranteed) chance that when you’re competing for an engineer, they’re simultaneously looking at offers from bigger organizations like Google or Microsoft. You need to be able to clearly demonstrate how your organization and the role they will play is better than what a bigger company would provide. And often times, speed is the best way to demonstrate it! If you are more nimble and responsive in the process and decision making, it can only strengthen your position versus a slower bureaucracy.  

Single points of failure are an organization's enemy   

The most stressful part of the past four years was our lack of organization resilience. If we lost our database engineer, who would failover if the master database dies? If a critical  college recruiter left right before the recruiting season, who would represent Yelp? As you scale, these are scary realities. You can’t have single points of failure like these at scale. As a manager, you need to always be looking for backups and planning for issues, so when the unexpected actually happens, the organization continues to operate smoothly.

Your team grows like a fractal

It’s critical to establish the company’s social norms, positive habits and mindsets early on as they will later define your larger organization. When building an organization, you should imagine a fractal -- a pattern that repeats at all different scales. If you failed to hire the right people to scale the organization, you’ll be dealing with a problem at an entirely new scale as the fractal grows. And if an organization is broken in a market as competitive as Silicon Valley, you’ll never be able to attract game-changing talent.

From 2007 - 2011, our operations team was running as most small organizations run: a couple of heroes save the day when the site goes down, they install all the packages, handle new hardware orders, add bandwidth -- essentially, they do it all. This doesn’t scale well as you get to 100+ engineers. The heroes get overwhelmed with the demands of the bigger organization and communication breaks down. Firefighting resources and energy needs to transition to fire prevention. Over the next year, we worked on recruiting and setting up the engineering organization to work in a more collaborative devops working style.

In 2012, we started migration to Amazon Web Services. We didn’t purchase a new server rack in all of 2015! Currently, ~90% of our data center spending is on the cloud, an accomplishment that was entirely implemented by our tremendous engineering team. By being almost completely cloud native we can move faster than traditional do it yourself rack-and-stack teams. We are able to: scale up and down services in minutes without being locked into hardware purchases, quickly modify resources for large traffic spikes (for holidays, partnerships, etc...), and we can actually use demand based machine utilization - elastically provision machines, saving us tons! Being cloud native is our goal, every engineer should be adding value to the product - your organization doesn’t have time to replace fans, bad memory, or hard drives. If you want to use our cutting edge technology we’ve open sourced our big successes along the way, the main project being PaaSTA which allows teams to continuously deploy, release and monitor services using a host of OSS tools.

IPO day - March 2, 2012

Looking ahead

As an engineering team, we’re continuously attempting to redraw lines of how a modern day web scale engineering team operates and executes. I’ve outlined three challenges from our early organizational scaling: 1) building a recruiting organization inside of engineering to optimize our recruiting pipeline, 2) eliminating single points of failure in the organization and 3) ensuring a strong foundation before scaling to prevent even bigger problems. In future blog posts, I’ll catch you up to modern day Yelp Engineering and other historical facts that can serve as a roadmap for growth that smooths out some bumps along the way.

Follow me on twitter: https://twitter.com/stopman