Settings

Theme

Google bot delays executing JavaScript for days

itbrokeand.ifixit.com

128 points by dbeardsl 13 years ago · 37 comments

Reader

h2s 13 years ago

    > If you're removing code or changing an endpoint,
    > be careful you don't screw the Google bot, which
    > might be "viewing" 3-day-old pages on your
    > altered backend.
An interesting proposition. Personally, unless I was operating in some sector where keeping Googlebot happy was key to staying competitive and there was solid evidence it could hurt my page rank, I don't think I'd be prepared to go to this length. Google is doing quite an atypical thing here compared to regular browsers and I'd like to think Google engineers are smart enough to account for this type of thing in the early stages of planning.

They have a difficult cache invalidation problem here. The only way to find out if the Javascript in use on a site has changed is by checking if the page HTML has changed. And on top of that, the Javascript can change without any noticeable change to the HTML.

  • zaptheimpaler 13 years ago

    obligatory: "There are only two hard problems in Computer Science: cache invalidation, naming things, and off-by-one errors."

ashray 13 years ago

Googlebot also does some other crazy stuff. Like looking at url patterns and then trying out variations.. they're almost trying to sniff URLs!

For example if I have a page: www.domain.com/xyz/123

Googlebot (without any links to other pages, will actually try URLs like) www.domain.com/xyz/1234 www.domain.com/xyz/122 www.domain.com/xyz/121 and so on...

It's crazy how much 'looking around' they do these days!

  • ceejayoz 13 years ago

    I believe that one's mostly a search for duplicate content - looking for URL parameters that don't make a difference.

eli 13 years ago

I'm not too surprised. I've got Googlebot still requesting old URLs even through there are no incoming links to them (that I know of) and they've been either 404 or 301 redirected for six months. I even tried using 410 Gone instead of 404, but it made no difference.

  • chrislomax 13 years ago

    To just reiterate this further, I am still 301'ing urls that have been dead for nearly 5 years. I still get requests in for them. I don't want to 404 them in fear of losing that slight bit of traffic so I just 301 them. I am really surprised they don't remove these urls from their cache and I can't think for the life of me why they don't?

    • gizmo686 13 years ago

      They might have some obscure incoming url from somewhere else on the net.

      • SquareWheel 13 years ago

        Webmasters should show that. If you 404 the page it'll appear in the errors pane after some time and show incoming sources.

        • gizmo686 13 years ago

          It sounds like the only thing requesting the 404`ed page was google bot, which I do not believe tells you the referrer. If this is true, then it would mean either that google does not clear their cache (which I doubt), or that the link exists somewhere on the net, but in a place where no human would find it. I've done some work with web crawlers, and it you fall into that type of hole alot more often than I would expect.

          • SquareWheel 13 years ago

            I'm not sure I understand, why wouldn't Webmasters show that one hard to find link if Googlebot found it?

        • eli 13 years ago

          I removed a whole section from the site at the same time. Webmaster Tools shows the incoming links for every page in the section as other pages in that section. It's a whole loop of pages linking to each other and generating inbound links even though they all no longer exist and haven't for many months.

          • SquareWheel 13 years ago

            Yeah, Webmasters is reaally slow to update. Thankfully they offered a way to delete old pages. If they come back though, then it should show the source of the link.

  • udfalkso 13 years ago

    Same here.

jes5199 13 years ago

Your users may be, too. It's not unusual for me to open my sleeping laptop several days later and expect the open web pages to work without refreshing them.

  • oakwhiz 13 years ago

    What might be a good idea for Javascript-heavy web apps is to make an Ajax call to the server to see if a refresh of the page is required.

    • jacobolus 13 years ago

      Please don’t even call out to the server, unless I actively interact with the page. I sometimes open 50 or 60 browser tabs at once, and when I unsleep my laptop or connect to a new wifi hotspot, many of them try to simultaneously make such ajax calls, which prevents any other web pages I want to open from loading until those calls either make it through or time out. Occasionally I have to SIGSTOP my main browser and open a different one if I need to access something online right away. Even pages of ostensibly static content, like years-old news articles, now are littered with "web 2.0" doodads on them which do this kind of crap.

      • lifeisstillgood 13 years ago

        Please don't take this the wrong way but why do you have 50-60 browser tabs open? Are you using browser tabs as some sort of todo list? Isiteffective?

        • jes5199 13 years ago

          I currently have 42 tabs open in Chrome (in two separate windows). Yes, it's like a todo list. It would work better if I had a mechanism to de-duplicate tabs that were open to the same URL - sometimes I find that I've got three tabs open all monitoring the same CI build. But I also keep several JIRA bugs open in tabs so I don't forget them, two Gmail accounts and twitter, I've got five-ish articles that I was in the middle of reading, some reference documents for a few projects I'm in the middle of, a couple of youtube videos that I haven't found time to watch yet...

        • oakwhiz 13 years ago

          There have been days when I have had more than 100 browser tabs open. Users should be able to have as many tabs as they want.

          • lifeisstillgood 13 years ago

            That was not the question - I am paying to have a small office to reduce my distractions, and allow focusing. I am just interested in how people manage the distractions, and with 100 tabs open I would presume that they are not all open regarding the same single task at hand.

            • AndrewDucker 13 years ago

              They aren't.

              Most of them are open from other tasks. But that doesn't make them a distraction. They're there for when the current task finishes and I can go back to them.

              The alternative is to close them all down and reopen them, which is vastly more time consuming.

    • moreati 13 years ago

      Please don't do that. I left that tab open on purpose, I'm half way through reading the page - if the page refreshes I likely lose my position

      • iamjustlooking 13 years ago

        You don't have to refresh the page, you could make it so that your next page click loads the full page instead of using ajax/pjax.

        quick pjax e.g.:

          <html data-lastupdated="1234567890"...
        
          $.getJSON('/lastupdated.json', function(lastupdated) {
          	if(lastupdated > $('html').data('lastupdated'))
          	{
          		$('a[data-pjax]').removeAttr('data-pjax');
          	}
          });
        • oakwhiz 13 years ago

          This is exactly what I meant by my initial comment - I should have been more clear.

      • ______ 13 years ago

        You don't have to force a reload, you can just suggest that the user reloads. Eg: "Please reload the page to get the latest version"

  • xiongchiamiov 13 years ago

    It's a call that was being made (only) when the page loaded.

TazeTSchnitzel 13 years ago

I wonder if it is Google's visual site previews/thumbnails that you get when you click on the arrow at the side of a search result, that are doing this.

Perhaps Google fetches the crawled page from the cache and then renders that for the previews?

  • sj26 13 years ago

    This was my first thought, and seems likely. They do several forms of analysis on their cache. It could even be some engineers running tests or queries that require rendering the page or at least bootstrapping the DOM.

georgemcbay 13 years ago

Is this surprising? I'd expect the possibility of this sort of behavior from any system that was vaguely Map-Reduce-y and operated on the scale of data that Google's indexing does.

ericcholis 13 years ago

I'm wondering if some of the simpler cache-busting tricks would force google update their cache. For example, somescript.js?v=201210221559.

  • dbeardslOP 13 years ago

    That's not the issue here, we include the md5 hash of the content in the url of every javascript / css asset. New pages had all the correct (brand new) urls. The issue is that Google is executing javascript on html pages they downloaded days ago. The only solution I can see is to fire off cloudfront cache expiration requests for all old assets. But that negates the simplicity of including the hash of the content in the url.

    • ChuckMcM 13 years ago

      Is it possible that people are looking at the page from Google's cache? I'm thinking the 3taps kind of 'web site scraping that doesn't look like web site scraping'

      • xiongchiamiov 13 years ago

        Hmm, that's interesting. I don't think so, though, because the user-agent on the requests is the googlebot:

            From: googlebot(at)googlebot.com
            User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
        • ChuckMcM 13 years ago

          Well an interesting check would be to look at one of your pages in the cache that fires an AJAX call and see where that call comes from. I agree it would be 'weird' if it came from Googlebot instead of the browser looking at the cache.

          At Blekko we post process extracted pages of the crawl which, if they were putting content behind js could result in js calls offset by the initial access but 3 days seems like a long time. Mostly though the js is just page animation.

      • martin-adams 13 years ago

        Would it make sense that loading from the cache makes a call to the origin server?

        I just checked one of my sites which loads available delivery dates via ajax through the google cache, and yep, it caches that as the dates are when the cache was taken.

lists 13 years ago

Did anyone else get really bad font rendering running Chrome on Windows 7?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection