Re: robots.txt on git.postgresql.org

From: Dave Page <dpage(at)pgadmin(dot)org>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: robots.txt on git.postgresql.org
Date: 2013-07-10 08:35:24
Message-ID: CA+OCxoyOiOLbk8PM_HJCfnNj=uxgmOYz+cA4s40CUm9vWYSOeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 10, 2013 at 9:25 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> On 07/09/2013 11:30 PM, Andres Freund wrote:
>> On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
>>> I note that git.postgresql.org's robot.txt refuses permission to crawl
>>> the git repository:
>>>
>>> http://git.postgresql.org/robots.txt
>>>
>>> User-agent: *
>>> Disallow: /
>>>
>>>
>>> I'm curious what motivates this. It's certainly useful to be able to
>>> search for commits.
>>
>> Gitweb is horribly slow. I don't think anybody with a bigger git repo
>> using gitweb can afford to let all the crawlers go through it.
>
> Wouldn't whacking a reverse proxy in front be a pretty reasonable
> option? There's a disk space cost, but using Apache's mod_proxy or
> similar would do quite nicely.

It's already sitting behind Varnish, but the vast majority of pages on
that site would only ever be hit by crawlers anyway, so I doubt that'd
help a great deal as those pages would likely expire from the cache
before it really saved us anything.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2013-07-10 08:36:06 Re: robots.txt on git.postgresql.org
Previous Message Markus Wanner 2013-07-10 08:32:41 Re: Review: extension template