Settings

Theme

Cat /proc/cpuinfo or don't trust your cores to rackspace part i

rubyrescue.com

9 points by inaka 16 years ago · 33 comments

Reader

jbyers 16 years ago

Don't trust any servers to anyone. When we get a new server we check its stats against reality (have had upside and downside surprises on CPUs), run bonnie++ to make sure IO is as expected (it hasn't been due to exotic RAID problems), and run memtester to see if we have bad RAM (had that too). Takes more time, sure, but no surprises later.

  • jacquesm 16 years ago

    That's the right attitude.

    For extra points do serious burn-ins, especially on network hardware, keep a good eye on those error counters as well as mcelog in case you got a faulty ram in there.

    It's all part of commissioning a server, especially if you host at a cheap outlet like rackspace.

    • cperciva 16 years ago

      a cheap outlet like rackspace.

      I've heard rackspace called lots of things, but this is the first time I've heard someone call them "cheap". Have their prices gone down lately?

      • jacquesm 16 years ago

        On a relative scale they're cheap, what they call 'managed hosting' though is not what I'd call managed hosting. I think they call it managed hosting because they will do backups for you or something like that :)

        The Planet/EV1, which was my choice when hosting in the US earlier was quite a bit cheaper, but service there was absolutely terrible.

        It got to the point where I reprogrammed the DRAC cards to lock out their sys admins.

        After The Planet took over we had all kinds of issues, then finally they had an explosion in a transformer in one of their datacenters taking down all of our stuff for days on end. After that we moved out.

        They said it had nothing to do with them and we would be credited for the downtime if we stayed for at least another 6 months, but we had by then already signed up elsewhere and restored from backups. I figure if you're willing to run your operation that close to the red line then we should be taking our business elsewhere. They were lucky nobody got hurt.

        Right now we're hosting in three places, leaseweb, mojohost and virtual acccess. VXS is by far the best but expensive, leaseweb is somewhere in the middle and for high volume mojohost is absolutely unbeatable.

        btw, you have me curious what else you've heard rackspace called :)

        • Confusion 16 years ago

          I don't have experience with a wide range of hosting companies, but out of five, VXS is the only one about which I have had nothing to complain. They are very good. The primary difference with other companies is that their staff actually know their stuff.

        • cperciva 16 years ago

          btw, you have me curious what else you've heard rackspace called :)

          Most of what I've heard about rackspace is that they had a good reputation once, but have been resting on their laurels and now have higher prices and worse service than other hosts -- but this is all 2nd hand and I have no direct experience with them, so I was curious to hear other perspectives.

          • jacquesm 16 years ago

            Ok. I think if they worked a bit harder at justifying that 'managed hosting' bit then they would actually be worth it.

            With both mojohost and leaseweb I basically only bug them when there are network or hardware issues, for the rest the problems are mine. VXS is a different story, that's where all the web servers are, their operators help with warding off all kinds of attacks, proactively scan for security issues and so on. 24x7 cell phone of the manager of the hosting facility.

            It's very addictive, that level of service.

            They charge a pretty penny for it, but imo it's worth it, it is still much cheaper than having a full time sysadmin for our stuff, and they probably do a better job of it.

  • jacquesm 16 years ago

    Apropos upsides, always do an fdisk /dev/sdX for the drives that are not installed in your machine, sometimes you get more than you paid for.

    This happened to me several times at EV1, same with memory and CPU.

    It really pays off to check if you can boot a machine with an SMP kernel and another CPU shows up too (don't bother doing that on a celeron box though).

jacquesm 16 years ago

If you never run top '1' then you shouldn't be operating servers for customers.

  • inakaOP 16 years ago

    attempting to ignore snarkyness, but failing - and that says what about the dozens of rackspace engineers configuring and monitoring and supporting the box over the past two years?

    • jacquesm 16 years ago

      There once was a really nice quote here on HN: "you can't outsource responsibility".

      If in two years time you've never ever had a look at what kernel you are running, especially while tuning a system for performance you only have yourself to blame.

      Don't tell me you're running a 'stock' kernel and never bothered tuning it for your application, or considered upgrading it. Also, in your resources list you should have the exact machine configuration, there are tools to retrieve that sort of info automatically.

      Then, when you're done, store it in http://inventory.sf.net/ or something like that.

      It's typical that the people at rackspace would simply drop in the requested hardware, and that you yourself deal with the configuration.

      The smart money is on running some tests after they've done that to make sure it went ok. Asking for a CPU upgrade and not checking if they're operational is just plain stupid.

      I figure you literally asked rackspace to upgrade the CPU, and that's what they did.

      Did you explicitly ask them to install an SMP kernel with a specific version and they didn't do it ? Or did you expect them to do it but you didn't check if they actually did until today ?

      Two full years of trying to tune a box for performance and not noticing this, then publicly blaming rackspace is simply cheap, an attempt at pinning the blame on rackspace, for something that you should have noticed long ago yourself.

      Kudos for writing about it but the title should be "How I messed up". That's taking responsibility and then make sure it never ever happens again.

      • nailer 16 years ago

        'Don't tell me you're running a 'stock' kernel and never bothered tuning it for your application, or considered upgrading it.'

        Not sure why you've got 'stock' in quotes. Vendor kernels are used by hundres of thousands of servers, each sharing the same bug reports and security updates. There's a massive benefit unless you think you can do those bug reports and security updates better than your OS vendor.

        Most custom compiles are by people who don't understand loadable modules or read somethign written before they existed.

        • jacquesm 16 years ago

          Ok, point taken, but there are definite advantages to 'rolling your own'.

          • forkqueue 16 years ago

            Such as?

            Using a vendor-supplied kernel means that there are extremely likely to be other people using most of the same stack as you, many of them on the same hardware. If there are problems, it's likely that other people have noticed the issue, even if the bug hasn't been found, so it's much more likely to get fixed.

            If you compile your own kernel (and/or copy of Apache, MySQL etc etc) you're running something unique to you. If you have problems, you're on your own.

            If you're paying for Red Hat Enterprise, use the Red Hat Enterprise packages unless there's a good reason not to. If something goes wrong, you can call Red Hat support and have at least a steer in the right direction. Custom-compiling everything just for the sake of it, just to have new 'shiny' stuff is crazy.

            • jacquesm 16 years ago

              It's not for the 'new shiny' at all, it's got to do with optimizing your kernel to match your hardware and getting rid of loadable module support in favor of a kernel that has on board exactly that which is needed to operate your system.

              A 'stock' kernel has a whole pile of things in it that might be the next remote exploit, by removing such stuff you marginally increase security.

              Other things you might need:

                 - kernel support for booting from raid filesystems without trickery
                 - processor family optimizations
                 - maximum number of cores (stock = 8, we run 16 on quite a few machines) 
              
              As for compiling, I do that anyway, it's a small job compared to the number of times that you need to do it. And you're just as much 'on your own' to solve problems, the chances of having them are less though (because the system you are running is considerably leaner).

              Second your redhat enterprise solution, that's not what I'm using though on most of our machines (either centos or debian), but that's a good solution too.

              • nailer 16 years ago

                Not a big user of software RAID, but AFAIK booting from a metadisk / is still out of the box doable as it was a few years ago:

                "- processor family optimizations"

                RHEL / CentOS include a variety of kernels precompiled for different CPU architectures.

                "- maximum number of cores (stock = 8)"

                What distro? RHEL / Centos support far more than that, we've got quite a number of 32 core machines and have a few 64 core boxes in test.

      • inakaOP 16 years ago

        Honestly your tone stings a bit, jacquesm. I guess i shouldn't have titled it 'don't trust rackspace' but my larger point is exactly the opposite of what you wrote - take responsibility for your servers - don't trust anyone, even the most expensive hosting provider, to do it for you.

        • jacquesm 16 years ago

          That's a whole lot better, if that was the message then it somehow got lost to me.

          Again, apologies for the tone, but it really seems to be a trend to make a mistake, 'blame someone', then blog about it.

          What I would suggest you do, and this is meant very seriously, is find a cheaper hosting provider (EV1/The Planet is about half of what you pay right now) and spend the rest on getting a part-time sysadmin that really knows his stuff.

          The difference in $ should be minimal, then look over the guys shoulder at how it is done, but keep doing what you know is your 'level' anyway. That way you get the best of both worlds, excellent care and you don't break the bank, at the same time you'll learn a huge amount.

          And if your startup grows you just might have found an employee for the future.

          Spend some time looking around, your best bet would be a guy or girl that does sysadmin duties for a larger company using UNIX that wants to make some extra $ in their spare time.

      • sailormoon 16 years ago

        I really do not dig this tone. The guy is obviously not a system admin. He paid top dollar for rackspace managed hosting precisely so he wouldn't have to do the kinds of things you mention.

        "You can't outsource responsibility" is utter nonsense. It is completely impossible to "own" responsibility for everything important in a complex society. Meaningless platitudes should not distract from the fact - Rackspace did not do their job.

        Yes, he messed up. He messed up by making assumptions and not checking Rackspace's work more closely. That's not the same as messing up in your own work. His post is a reminder to be more careful checking on the work of your "upstream". There's no need to pile on with the "if you didn't know 'top 1' you shouldn't be running a startup!" etc.

        • thaumaturgy 16 years ago

          I'm not a big fan of the tone, either, however, jacquesm is spot-on in his assessment.

          For one thing, my understanding of Rackspace's business practices -- and I've only dealt with them peripherally, so I might be a bit wrong here -- is that they "manage" things like their network, and the actual server hardware, and stuff like that. So, if you want a CPU upgrade, sure, they'll do that. If you need your server rebooted, they'll do that too. But, they don't have anyone sitting there monitoring your system's performance metrics and doing your sysadmin duties for you.

          The way I read it, Rackspace did do their job: they upgraded the hardware. It was up to the server admin -- not Rackspace -- to check that the software was then configured correctly.

          And finally, I don't generally agree with statements of the form, "If you don't know X, you shouldn't be doing Y", but ... looking at dmesg and top are both really, really, really standard sysadmin operations. Entry level stuff, really. Sysadmin work doesn't just mean messing around with Apache's configuration; there are many more nuances, and it's likely that their system is vulnerable to problems that they don't even know about.

          • jacquesm 16 years ago

            The tone is probably in large part because the OP does not take any responsibility for his own part in this and instead is pointing his finger at a third party that may have been partially at fault. But that is by no means sure.

            This is typical with what I think is a real problem in society, the 'externalization of blame'.

            Inability to see your own responsibility is a serious issue, and it is really pervasive. If I were in the OPs position I would be headbutting a piece of concrete for 20 minutes to make sure I never ever make a mistake like that again, and I would thank rackspace for finally finding the fault that I could have noticed in 5 minutes two years ago.

            That's why you have post-delivery checklists, burn in tools and inventory management, staples of everybody that has # on machines that do customer work.

            I'll try to keep my 'tone' better under control, apologies for that.

            At least it wasn't in Dutch ;)

          • inakaOP 16 years ago

            probably not a good idea to comment authoritatively on a company you haven't worked with, but no, kernel management is part of rackspace's job. performance and monitoring is part of their job. they have an SLA and this is absolutely part of it...

        • jacquesm 16 years ago

          I didn't say he shouldn't be running a startup, I said he should not be managing the servers their customers stuff runs on.

          As for the tone, you may disagree with that but that does not distract from the fact that if you operate a business, that you should know your stuff.

          And if you outsource something you should at least know how to check up on the bits that you've outsourced.

          Outsourcing does not mean that your responsibility disappears, it simply changes from 'doing' to 'monitoring'.

          Maybe rackspace did not do their job, I have no insight in the communications that went on between the party involved and rackspace.

          All we get here is a pointing finger without any responsibility taken, that is not a realistic picture.

          It could be the difference in the wording of the upgrade request ("please install another CPU in our machine" vs "please install and configure another CPU in our machine").

          Even then, rackspace probably should get part of the blame, but really not all of it. The fact that the situation persisted for two years is completely on the OPs account, in two years you have many more opportunities than your hosting provider to find this out, after all they will leave your machine alone unless it malfunctions and there is no indication that they ever were requested to look in to this, and when they were they actually found the problem.

          I quote from the article "In investigating an unrelated issue, we followed up with Rackspace on a Kernel patch that couldn’t be applied to our server. One of the technicians immediately realized why – we were not running the SMP kernel."

          How come someone is trying to patch a kernel, can't apply the patch and then still doesn't clue in to the situation ?

          Also, we do not know if the SMP kernel was installed or not, it might have been, and then on the final reboot the wrong kernel was brought up. And that's a very easy mistake to make.

          But dmesg would tell you in a heartbeat, as would 'top '1'', which you would be using plenty of times while debugging performance issues to make sure all your cores are doing the right amount of work.

          • sailormoon 16 years ago

            "I said he should not be managing the servers their customers stuff runs on."

            And what if it's only him? No go then huh?

            You've been saying a lot of this kind of thing lately. That guy before with the App Store payment problem? You came down on him like a ton of bricks. And now this. Just because people haven't dotted every i and crossed every t. It's not exactly the hacker mentality is it?

            • jacquesm 16 years ago

              Then he could hire a part-time sysadmin, there are plenty of those looking for work. I figure for $200 / month he can switch to a similar powered dedicated server with a competitor and pay a guy for 4 hours worth of real hands on sysadmin time every month. That way he pays roughly the same and comes out ahead in every way.

              Managing UNIX systems that have to perform well under load takes quite a bit of knowledge. Sure, everybody can install 'ubuntu', 'redhat', 'gentoo' or whatever flavor is popular this week. But that does not make you a system administrator. I wouldn't trust myself with my customers machines either, simply because to stay up-to-date on all the holes in all the packages that you may have installed and keeping them patched is real work.

              I don't think I came down on the app-store guy 'like a ton of bricks', in fact I gave what I thought was pretty sensible advice and offered (after Sam Odio did) to help him out.

              But it's essentially the same problem as what is happening here, blame company X because of something that you caused yourself.

              The app store guy:

                - quit job before having money in the bank
                - set up overly complicated corporate structure to avoid non-existent liability
              
              This guy:

                - take responsibility for a part of the operation that he's not qualified to do
                - keep on messing for two years without calling in outside help (sure, it will cost)
              
              And both of them point the finger at another party.

              So maybe that's why it seems to you that this is a 'lot of this kind of thing'.

              As for whether or not it is the hacker mentality is not my thing, I call it as I see it.

              I've had people here rip me to bits for making a stupid remark (and rightly so), if you can dish it 'Rackspace is at fault because they don't know how to upgrade a cpu' or 'Apple is at fault because they don't pay me' then you should be able to take it.

        • dagw 16 years ago

          It all comes down to what I paid for. Check your contracts carefully.

          Some places basically just give you a computer and make sure that it always has power and network and that the hard drive is backed up, and everything after that is up to you. Other places give you 24 hour sys-admins that continuously monitor everything and basically manage every aspect of your server. The former obviously costs a lot less than the latter.

          It's perfectly OK to outsource responsibility, but you've got to pay top dollar for someone to take on that responsibility. You cannot go for the cheap option and expect all the services offered by the expensive option.

  • aminuit 16 years ago

    If you're bragging about 300ms average response times...

    If you're using "spidey-sense" to determine the underperforming elements in your software stack...

    There are a number of uncomfortable ideas in this article.

    • thaumaturgy 16 years ago

      If you're swapping out your web server software without doing some more careful analysis first...

ericwaller 16 years ago

After the conversion from Apache [+passenger] to Nginx [+unicorn] overall throughput went up by a factor of 3.

I noticed a similar gain (10-12 req/s to 25-30req/s) for action cached pages after switching out nginx+passenger for nginx+thin. This is on a 256mb slice, and seems totally counter to the general opinion that passenger is great for VPSes.

  • jcapote 16 years ago

    Have you tried nginx/passenger? I ask because I wonder how much of that gain is part of switching to nginx rather than switching to unicorn...

    • inakaOP 16 years ago

      on a separate application and server, yes, and it seems faster than apache/passenger, but it still suffers from the 'passenger choke' where touching tmp/restart.txt seems to pause all web traffic for 10-15 seconds, so we never considered it here...

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection