Search Engine's journey of Crawling and Indexing
The journey of Search Engines
Google, as we all know has been the king or you can say, the judge of the SEO game. It’s complicated and the ever changing algorithms make your websites dance along the SERPs as well as dictate the changes you make to your website. It’s a well indexed database of over 100,000,000 gigabytes built over around a million computing hours supports the results it displays.
In actual terms, the journey of a search starts much before you type your query with the crawling and indexing of millions and trillions of documents.
Crawling: Finding information.
Google uses its bots or softwares known as “web crawlers” to discover the publically available pages on the web. These crawlers look at these webpages, follow the links from page to page and bring back the data back to Google’s servers.This marks the process of crawling. It entails a list of web addresses collected from the past crawls and sitemaps provided by the webmasters. On visiting a website (by crawlers), they look for the links of other pages to visit giving special attention to new sites, changing in existing sites and dead links. The computer programs which runs these crawlers determine which sites to crawl, how often etc.
Are you wondering, why haven’t your site been crawled yet?
Start investigating by entering “site:site.com” in the Google search bar. Now, match the number of pages you actually have with the number of actual results which are shown in the SERPs. If the difference is a large number, then it’s time to dig deep into finding the errors and correcting them simultaneously.
So, start with analyzing your Google Webmaster Tools dashboard. If Google has any issues with crawling your site, it will list those errors and you can correcting them accordingly.
The reasons as to why Google would not have crawled your site are as follows:
- .htc access: This is simply an invisible file which resides in your WWW. Badly or incorrectly configured htcaccess can lead to infinite loops which can drastically increase your site’s load time.
- Meta tags: Putting Meta tags like Robots.txt, Nofollow, Noindex on pages prevent the search engines from crawling and indexing them. So make sure, you are not doing that for the pages you genuinely want to get indexed by Google.
- Sitemaps: Your Sitemap may not be updating itself for some reason and you are continuing to put the same sitemap without addressing the issues and errors pointed out by Google Webmaster Dashboard.
- Your PageRank is really low: This might sound surprising to you but, Google crawls the pages roughly in proportion to the PageRank.
- DNS issue: There might be some issue in your server, due to which Google bots might not be able to reach you or there might be some maintenance on their network which is creating problems.
Indexing: Organizing information
The web can be understood as a public library with trillions of books. The pages crawled and gathered by Google in the earlier stage is indexed in this stage. Similar to that of a book, Google indexes information about the words and their locations. Whenever a query is entered, it’s algorithms look up for the searched keyword in its index and come back with the relevant results.
The process becomes much more complicated at this stage. For example, if you search for the term ‘mousse cake’, you would not want a page which has ‘mousse cake’ written all over it. You would want pictures, tutorials on how to make it, which of the nearest restaurants sell it etc. Therefore, Google extracts the intention from the entered keywords, scans through its indexed database and comes up with relevant results.
Why is Google not indexing your site?
If you are not indexed by Google, you will not get any organic traffic. So, in a way you are losing everything if your website is not indexed by Google. These are some of the top most reasons for your site not getting indexed by Google:
- The www and Non-www Domain issue
Well, www is a subdomain. Therefore http://www.mysite.com is different from http://mysite.com. Ensure that both these versions of your site are added to your Google Webmaster Tools account.
- You have blocked your sites/pages with robots.txt
The developer or the editor might have blocked the site using robots.txt (maybe, by mistake). This has nothing to worry though because it can be easily fixed. Simply, remove the entry from robots.txt and your site will appear again in the index.
- Google might have not found you
If you are a new site, then wait for some time for Google to index you. But if it has been long that Google hasn’t indexed your website, make sure that your sitemap is already uploaded and is working fine. You can also request Google to crawl and fetch your site after signing into your Google Webmaster Tools account.
- The www and Non-www Domain issue
- Plenty of Duplicate Content
Too much of duplicate content on the website can make the search engines give up on you. If multiple URLs to your website is bringing back the exact same content, then the search engines would count it as duplicate content. Keep one of your preferred page and 301 the rest.
How to get your site indexed by Google
Are you worried or should I say, frustrated that your website or some of your internal pages or some of your blogs are not getting indexed by Google? Worry not, the following list is at your help:
- You would first want to see whether your site has been crawled or not. Set up a Google Webmaster Tools account, if you do not have it already. After you are done making your account, make sure you have already submitted your website with the Search Console. After submitting your site, you can check it’s Crawl status in Google Webmaster’s Tools as can be seen below:
- Create a XML Sitemap. A Sitemap is basically a document on your website’s server which lists each page (URL) of your website. It is basically a way to inform the search engines when new pages have been added and how often to check for changes on particular pages. For example: You would want the search engines to check your homepage regularly for new products, news and other related content. You can use RankWatch’s Sitemap generator tool to build your Sitemap and then submit it to Google in your Webmaster’s Tools account.
- Install Google Analytics. This can be for tracking purposes but it also has the added advantage of indicating Google that there is an upcoming new website.
- Create and update your Social media profiles from time to time. Surviving on the webspace without building the social media pillar for yourself is mere stupidity. As explained above, Google’s bots follow and reach your site via links. Being active on social media can get you quick links. So make sure that you have a presence on all the important social media sites like Facebook, twitter, LinkedIn, Quora etc.
- Make sure that your site loads faster. Optimize your sites for loading quickly. A faster site has a plus point in getting indexed faster.
- Add fresh and unique content regularly to your website. Google gets very impressed if the quality of your content is good. In other terms write SEO friendly content to your website.
- Make you Navigation easier and simpler. A good navigation is not only important for the users, but holds equal importance for getting your site indexed faster. The more clear the navigation of your site, the more convenient it is for the search engines to crawl through every part of your website.
- Get good quality inbound links. As explained earlier, Search Engine bots or crawlers will find and index your sites faster if, often crawled and indexed sites are linked to it. So, build your linking strategy in order to get links from quality and authoritative websites.
- Good internal link structure. Ensure a good and strong internal linking within the pages of your website. Try to connect the internal pages with your homepage, but make sure that the links on a particular page does not exceed 200.