My *nix world

Check Google page rank of entire website

There are hundred websites that provides a tool to check your website rank. What the heck? What are my chances to get the first position, will you ask? Then I will ask this: for what keyword? Because, you see, if you search for weather then probably your website wont get the first rank, perhaps a weather minded website will do, right? So the right approach is my page has this rank for this keyword(s).

Now, assuming that we made clear this point, I will tell you my story about website ranking.

I was interested in finding such a web service (hopefully a free one) able to provide the rank of an URL for some specified keywords. Oh, there are plenty of them and they are free 🙂 . But how about a service able to scan my whole website for those (meta)keywords contained by the page itself and then to provide the page ranking for those keywords. Well, that would be a problem. There are not so many web services like this or at least I wasn't fortunate enough to find them. Anyway, these circumstances led me to what later became a PHP project which solved this issue.

How to check Google page rank of entire website

check google page rank

Basically I wrote from scratch a PHP application that:

  • read your sitemap.xml file in order to get a list of pages for your entire website
  • scans each page and gets its meta-keywords (from Open Graph or title meta-tags)
  • queries the search engine (like Google, Bing, Yahoo, Ask.com) for those keywords and gets the page ranking for their parent page
  • computes a traffic share estimate for that page (see below)
  • compiles a detailed report that shows for each page:

Now, the most search engines are not happy with guys like me because they are not willing to let us crawl and use programmatically their data, although they are surely happy when you use their search engine manually (perhaps you will give a click on an adds so that they will get some bucks, right?). Back to the point, because they are not happy with that they do all that they can to block your app from crawling their data. How can they do that? How do they know that "the thing" that queries them is nothing but a computer program and not a human being? Well, that's simple: I assume (I don't know for sure) they analyze how many query per time unit are sent to them, they check if they came from the same browser and/or from the same host, they check if the HTTP referrer has some header like a web page has, etc. So basically all we have to do is to send them questions/queries from different IP addresses and behaving like our queries come from different IP addresses. We can do that simply by using a proxy network (like Tor preferably with a controlling software like Vidalia) or just a list of free (HTTP/Socks) proxy servers. Regarding the browser that's simple too: we just send each query to them saying that the query comes from a random browser (the user agent signature) with some random referrer (although they might check this). Some example of random user agents (just to name few):

  • Mozilla/5.0 (Windows; U; Windows NT 6.1; pl; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3
  • Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)
  • Opera/9.20 (Windows NT 6.0; U; en)
  • Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24)

Now, how do we get a list of proxy servers and be sure that they are reliable/functional? Well, we crawl the Internet for them, right? So I wrote also some tools that crawl some proxy minded websites and hopefully we'll get thousands of proxies (host:port, type, etc). Besides that I wrote another tool that, in background, verifies each proxy to determine if it's working, to find out their speed and so on, thereby we've got our proxy list.

This functionality is written such it can be easily embedded in other projects. So I wrote just for fun (but also with testing in mind) an web interface that can be used to test the above named functionalities (see the screenshot above). By using that interface we can specify an URL, the proxy we are going to use and eventually, if we are using the Tor network, the anonymizer we use to get a random IP. Besides that we can specify the search engine (only Google, Bing and Yahoo are supported for the moment), we can specify how many seconds to sleep between each request and in case that we've got a message about banned IP (especially from Google) if we would like to retry by using eventually another IP address, if we want to hide/show the output errors and if we want to ignore images that eventually your sitemap.xml might contain (there is no picture that contains keywords so there is no point of checking their ranking by their respective keywords, right?).

If we want to use this functionality programmatically we have only to call the following function:

where the parameters talk for themselves except $handler which can be used to pass a callback function which can alter somehow the searched keyword before it's used (read the project Wiki for more info).

So the whole project is just a bunch of 1300 lines of PHP code that crawl the Internet for proxies and for page ranks. I don't know if the term "crawl" describes best what these apps do but you've got the main idea. The project's has a Git repository powered by bitbucket.org at: https://bitbucket.org/eugenmihailescu/pagerankchecker.

Note that this is the alpha-version of the project, something like two-three days of work. I am going to work more on this project and also to write more about parsing the HTML source code with DOM XPath (especially by using PHP programming language although I will do some work with Java too).

If you want to get updated with the latest news about the evolution of this project then please subscribe to my email list.

For those impatient I've installed the demo application on a ZendServer at http://eugenmihailescu.my.phpcloud.com/pagerankchecker. Please note that the link provided is a virtual machine (that probably hibernates right now) on a PHP cloud, so you might need to wait for it a minute or so before it will become waken up. You should also be aware that, unfortunately, due to some settings of the server in the cloud, the crawler does not give you a constant feedback/output so you have to wait until the script ends. Of course if you have only 10 pages that it's like 30 sec but if you have 400 pages...then it will take a while.

Now, if you think that this article was interesting don't forget to rate it. It shows me that you care and thus I will continue write about these things.

 
The following two tabs change content below.
Check Google page rank of entire website

Eugen Mihailescu

Founder/programmer/one-man-show at Cubique Software
Always looking to learn more about *nix world, about the fundamental concepts of math, physics, electronics. I am also passionate about programming, database and systems administration. 16+ yrs experience in software development, designing enterprise systems, IT support and troubleshooting.
Check Google page rank of entire website

Latest posts by Eugen Mihailescu (see all)

Tagged on: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow

Get every new post on this blog delivered to your Inbox.

Join other followers: