C#-tutorial

Web Crawler C# Tutorial: Example Capturing Big Data

Information is dumb. It has no brains. It has no advice. It is a massive, static digital paper weight. What was just said is completely true. Whether data is on the web, in your USB drive, in your pocket, it is considered inanimate, non-living. Big data and small data will not make decisions.

Information is astoundingly and dangerously beautiful. With the entire estimated web content indexed by google there is 100,000,000 gigabytes of data available. If each byte of data could be equated as the width of a single hair, it would make a line from Earth to Saturn exactly! Most people would agree with me when I say, just about everything humankind has ever learned is available online. Search Engines will respond with billions of pages for a single keyword. I would not doubt if there was information available on the internet about every USA person born since 1970.

Times are tough. If a person needs to find information about anything, they have to dig into there pocket, unlock there smart phone, use 2 thumbs, and type the desired series of characters. Then they have so many responses -within that 15 sec- that they will probably have a definitive answer in another 30 sec. Well what if you could capture entire web domains in the same time scope?
That’s exactly what web crawlers manage to do.

The aim of this tutorial is to set some achievable goals, challenge ourselves, and test the code as we complete each challenge.

Web crawlers come in different types geared toward different objectives. The main difference for this crawler, we will not be clicking through links. This crawler only wants data from where it wants it. In this project/article we will be creating a custom web crawler, based on particular specifications.

The custom crawler Challenges are:
Challenge 1:
1. Be written in C#
2. Connect to the web as a desktop console application
3. Render dynamic content (javascript, php, etc.) that might contain text
Challenge 2:
1. Output to an html or txt file
Challenge 3:
1. Query google for a list of URLs related to our input keyword (using google custom search api)
2. Crawl every url page listed by google
3. Scrap only the text content
Challenge 4:
1. Use Custom Search Engine in the Crawler
Challenge 5:
1. Scrap only the text content
2. Avoid crawler traps
3. Avoid Blacklisted and unsafe web domains

Big Data Dangers

With the 100s of pages the spider will be crawling it is important that we avoid blacklisted web domains. We also don’t want to scrape NSFW content. So, we can add those filters to our google API request. Moreover, the copious amount of data that we’ll be crawling through will require time and processing power. So, simple and efficient code will be our objective.

Unrelated text/code that the crawler sees could become cumbersome later in the data mining process. It would be advantageous for us to use elimination algorithms. The data storage would also be helped by this proactive processing method. But, the aim of this tutorial is simplicity for developing an understanding, so we will process this after we scrape it. I will explain this later in the tutorial.

Preliminary Setup

Most people will not prefer the console method I am using to compile C#, although it is efficient. To be clear, everything I aim doing is absolutely transferrable to the Visual Studios development environments. The C# compiler used in this tutorial is actually the .Net framework that Visual Studios uses as well. Later in the tutorial we will switch to visual basic, but for now stick with me.

Quick Setup:
Debug Mode:
Create a txt file by right clicking in a folder > choosing new > clicking text document

C#-Tutorial: Create-New-Text-File

C#-Tutorial: Create-New-Text-File

Rename the file to debug.bat

Open debug.bat in your favorite text editor

Enter the following text into the document:

cmd /K C:\Windows\Microsoft.NET\Framework\v4.0.30319\csc.exe /debug:full *.cs
exit 0

Save your document

With a folder containing C# you can double click it and a debug cmd window will open alerting you to C# bugs. It must be in the same folder as the C#.

Result:

C#-Tutorial-Debug-Bat-Response

C#-Tutorial: Debug-Bat-Response

Compile Mode:
Create a txt file by right clicking in a folder > choosing new > clicking text document like before

Rename the file to compile.bat

Open compile.bat in your favorite text editor

Enter the following text into the document:

cmd /K C:\Windows\Microsoft.NET\Framework\v4.0.30319\csc.exe /out:crawlertest1.exe *.cs
exit 0

Save your document

With a folder containing C# you can double click it and a compile cmd window will open. It must be in the same folder as the C#.

That’s all simple enough. You now ready to start the C# tutorial.

Challenge 1: Connect to the internet

My initial aim is to connect to engineerverse.com from my console application, and write the html rendered to the console, as verification for online access. The code below will allow you to scrape all content from engineerverse.com homepage and put in the console window.

using System.IO;
using System.Net;
using System;
using System.Text;

public class Crawler
{
static void Main()
{

//Create web request from URL array
WebRequest request = WebRequest.Create("http://engineerverse.com/");
// If required by the server, set the credentials.
request.Credentials = CredentialCache.DefaultCredentials;
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse ();
//Display Status
Console.WriteLine (response.StatusDescription);
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream ();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader (dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd ();
// Display the content.
Console.WriteLine (responseFromServer);
// Cleanup the streams and the response.
reader.Close ();
dataStream.Close ();
response.Close ();
Console.ReadKey();
}

}

Response:

C# Tutorial: Crawler App Response

C# Tutorial: Crawler App Response

Goals Completed:
Be written in C#
Connect to the web as a desktop console application
Render dynamic content (javascript, php, etc.) that might contain text

Challenge 2: Scrap Data and Store it Locally

We can always create a .txt document to house all the scraped content. So we will use the c# System.IO.File.WriteAllText command.

The code has been adjusted as seen below:

using System.IO;

using System.Net;

using System;

using System.Text;

public class Crawler

{

static void Main()

{

//query top X number of urls for keyword

//Create URL array

//Create web request from URL array

WebRequest request = WebRequest.Create("http://engineerverse.com/");

// If required by the server, set the credentials.

request.Credentials = CredentialCache.DefaultCredentials;

// Get the response.

HttpWebResponse response = (HttpWebResponse)request.GetResponse ();

//Display Status

Console.WriteLine (response.StatusDescription);

// Get the stream containing content returned by the server.

Stream dataStream = response.GetResponseStream ();

// Open the stream using a StreamReader for easy access.

StreamReader reader = new StreamReader (dataStream);

// Read the content.

string responseFromServer = reader.ReadToEnd ();

// Display the content.

//Console.WriteLine (responseFromServer);

//write response to textfile

System.IO.File.WriteAllText(@"C:\YourPath\Crawler Module\Test2\WriteText.txt", responseFromServer);

// Cleanup the streams and the response.

reader.Close ();

dataStream.Close ();

response.Close ();

Console.WriteLine ("ready");

Console.ReadKey();

}

}

All the data is now stored locally, however, it’s barely readable, so I adjusted the code to create an html file as seen below:

Change

System.IO.File.WriteAllText(@"C:\YourPath\Crawler Module\Test2\WriteText.txt", responseFromServer);

To

System.IO.File.WriteAllText(@"C:\YourPath\Crawler Module\Test2\WriteText.html", responseFromServer);

As long as you have the website open, you should be able to double click the html file.

Goals Completed:

Output to an html or txt file

Challenge 3: Creating Google Search Engine API

The criteria I set above requires that I use Google Custom Search API to GET the top 100 URLs based on a keyword input. Follow the next few steps to get your CSE working. This will be the first time in this tutorial we use visual basic. Downloading the NuGet packages manually is too intensive for this simple piece of code, so using visual basic the processes is quick and easy via the NuGet package manager.

Open Visual Basic > create a new C# console project

C#-Tutorial-Create-New-C#-Console-App-Visual-Studios

C#-Tutorial-Create-New-C#-Console-App-Visual-Studios

Add New Item to Project

C#-Tutorial-Add-New-C#-Class-ItemC#-Tutorial-Add-New-C#-Class-Item

C#-Tutorial-Add-New-C#-Class-Select-Class

C#-Tutorial-Add-New-C#-Class-Select-Class

Overwrite all the class text with the following code:

using Google.Apis.Customsearch.v1;
using Google.Apis.Customsearch.v1.Data;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
public class CrawlerAPIs
{
public static string[] CSE(string keyword)
{
const string apiKey = "YourKey";
const string searchEngineId = "YourID";
var query = keyword;
string[] newUrl = new string[100];
var count = 0;
//const string query = "engineerverse";
CustomsearchService customSearchService = new CustomsearchService(new Google.Apis.Services.BaseClientService.Initializer() { ApiKey = apiKey });
Google.Apis.Customsearch.v1.CseResource.ListRequest listRequest = customSearchService.Cse.List(query);
listRequest.Cx = searchEngineId;
//listRequest.Start = 1;
for (int i = 1; i <= 90; i += 10)
{
listRequest.Start = i;
listRequest.Safe = CseResource.ListRequest.SafeEnum.Medium;
Search search = listRequest.Execute();
foreach (var item in search.Items)
{
string holding = item.Link;
newUrl[count] = holding;
count += 1;
Console.WriteLine("Title : " + item.Title + Environment.NewLine + "Link : " + item.Link + Environment.NewLine + Environment.NewLine);
}
}
Console.ReadLine();
return newUrl;
}
}
Reference
Some of the code used in this section of the tutorial is borrowed from Tri Nguyen, wrote for setting up the Custom Search API, C# – How to use Google Custom Search API? – –http://hintdesk.com/c-how-to-use-google-custom-search-api/

The Library management at this point gets quite complicated. In order to ease the pain and help manage all libraries automatically we are going to continue using Visual Studios. Some readers may feel disappointed by this decision and unsatisfied, but trust me the library and assembly management for the debugging and compiling with just a .bat file will be boring.

The free custom search engine that google lets us use, only gives us 10 results per query. Moreover, a maximum of 100 queries per day. We are therefore limited to only 1,000 results total. We only want 100 for this test so we can cap the for loop at 100.

Notice the for loop was added as seen by the following:

//listRequest.Start = 1;

           for (int i = 1; i <= 100; i += 10)

           {

               listRequest.Start = i;

The for loop allows an iteration by 10 and uses the listRequest.Start parameter to change the start page from the query. Thus, allowing us to send multiple queries and changing the page number for the results on each query.

If you want to Test at this point you will need to edit the code, so we will delay the next test until challenge 4.

Goals Completed:

Query google for a list of URLs related to our input keyword (using google custom search api)

Crawl every url page listed by google to the max value

Scrap only the text content

Render dynamic content (javascript, php, etc.) that might contain text

Challenge 4: Use Custom Search Engine in the Crawler

Once you have the API setup, we will create a new class for our API. The class will be utilized by our crawler to tell it which URLs to crawl. The Custom Search Engine (CSE) method will behave merely as a process, so no return method is needed. 

Now let’s incorporate our Crawler class into a newly created main method from which we can call all our classes from.

Add a new C# Class Item just as we have previously in this tutorial

Feel free to name it to Main.cs

Next add the following code to the Main.cs file:

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

 

namespace ConsoleApplication1

{

   class MainProgram

   {

       static void Main()

       {

           CrawlerAPIs.CSE();

       }

   }

}

Notice the only thing we did was call the CSE class. Now we need to add the crawler class to this file from Challenge 1. Add the following code to the new C# item you create; notice that there are 2 methods with the same name, this is called over-defining, which allows us to create functions that can be dynamic.

using System.IO;
using System.Net;
using System;
using System.Text;
public class Crawler
{
public static void Crawl(string site)
{
//Create web request from URL array
WebRequest request = WebRequest.Create(site);
// If required by the server, set the credentials.
request.Credentials = CredentialCache.DefaultCredentials;
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
//Display Status
Console.WriteLine(response.StatusDescription);
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Display the content.
//Console.WriteLine (responseFromServer);
//write response to textfile
System.IO.File.WriteAllText(@"C:\YourPath\Crawler Module\Test2\WriteText.txt", responseFromServer);

// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
Console.WriteLine("ready");
Console.ReadKey();
}

public static void Crawl(string[] site)
{
for (int i = 0; i <= 100; i++)
{
//Create web request from URL array
WebRequest request = WebRequest.Create(site[i]);
// If required by the server, set the credentials.
request.Credentials = CredentialCache.DefaultCredentials;
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
//Display Status
Console.WriteLine(response.StatusDescription);
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Display the content.
//Console.WriteLine (responseFromServer);
//write response to textfile
Console.WriteLine("saving file for site ... " + site[i]);
System.IO.File.WriteAllText(@"C:\Users\Devin\Dropbox\Devins\Websites\EngineerVerse\Crawler\Outputs\WriteText(" + i + ").txt", responseFromServer);

// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
Console.WriteLine("ready");
//Console.ReadKey();
}
}

}

Next change the Main.cs file to the following:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApplication1
{
class MainProgram
{
static void Main()
{
var keyword = Console.ReadLine();
Crawler.Crawl(CrawlerAPIs.CSE(keyword));
}
}
}

Now you are completely able to compile and run, however, you will most likely run into errors.

Change the crawler.cs file to the following, which skips the url if there is an issue detected:

using System.IO;
using System.Net;
using System;
using System.Text;
public class Crawler
{
public static void Crawl(string site)
{
try
{
//Create web request from URL array
WebRequest request = WebRequest.Create(site);
// If required by the server, set the credentials.
request.Credentials = CredentialCache.DefaultCredentials;
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
//Display Status
Console.WriteLine(response.StatusDescription);
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Display the content.
//Console.WriteLine (responseFromServer);
//write response to textfile
System.IO.File.WriteAllText(@"C:\Users\Devin\Dropbox\Devins\Websites\EngineerVerse\Crawler\Outputs\WriteText.txt", responseFromServer);

// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
Console.WriteLine("ready");
Console.ReadKey();
}
catch (WebException webExcp)
{
// If you reach this point, an exception has been caught.
Console.WriteLine("A WebException has been caught.");
// Write out the WebException message.
Console.WriteLine(webExcp.ToString());
// Get the WebException status code.
WebExceptionStatus status = webExcp.Status;
// If status is WebExceptionStatus.ProtocolError,
// there has been a protocol error and a WebResponse
// should exist. Display the protocol error.
if (status == WebExceptionStatus.ProtocolError)
{
Console.Write("The server returned protocol error ");
// Get HttpWebResponse so that you can check the HTTP status code.
HttpWebResponse httpResponse = (HttpWebResponse)webExcp.Response;
Console.WriteLine((int)httpResponse.StatusCode + " - "
+ httpResponse.StatusCode);
}

}
catch (Exception e)
{
// Code to catch other exceptions goes here.
Console.WriteLine(e);

}
}

public static void Crawl(string[] site)
{
for (int i = 0; i <= 100; i++)
{
try
{
//Create web request from URL array
WebRequest request = WebRequest.Create(site[i]);
// If required by the server, set the credentials.
request.Credentials = CredentialCache.DefaultCredentials;
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

//Display Status
Console.WriteLine(response.StatusDescription);
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Display the content.
//Console.WriteLine (responseFromServer);
//write response to textfile
Console.WriteLine("saving file for site ... " + site[i]);
System.IO.File.WriteAllText(@"C:\Users\Devin\Dropbox\Devins\Websites\EngineerVerse\Crawler\Outputs\WriteText(" + i + ").txt", responseFromServer);

// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
Console.WriteLine("ready");
//Console.ReadKey();
}
catch (WebException webExcp)
{
// If you reach this point, an exception has been caught.
Console.WriteLine("A WebException has been caught.");
// Write out the WebException message.
Console.WriteLine(webExcp.ToString());
// Get the WebException status code.
WebExceptionStatus status = webExcp.Status;
// If status is WebExceptionStatus.ProtocolError,
// there has been a protocol error and a WebResponse
// should exist. Display the protocol error.
if (status == WebExceptionStatus.ProtocolError)
{
Console.Write("The server returned protocol error ");
// Get HttpWebResponse so that you can check the HTTP status code.
HttpWebResponse httpResponse = (HttpWebResponse)webExcp.Response;
Console.WriteLine((int)httpResponse.StatusCode + " - "
+ httpResponse.StatusCode);
}
continue;
}
catch (Exception e)
{
// Code to catch other exceptions goes here.
Console.WriteLine(e);
continue;
}
}
}

}

We now have a completely functional crawler that accepts the URLs saved by the google search engine. All the scrapped pages will be saved into the same directory as .txt files.

Goals Completed:

Use Custom Search Engine in the Crawler

Challenge 5

Scrap only the text content
Avoid crawler traps
Avoid Blacklisted and unsafe web domains

Technically, the first two goals have were handled because the text the crawler saves is just that, text. However, to make it more of a true accomplishment, we will simply remove all html tags from the datastream before we save it to a .txt file. The second goal has been mostly completed via try…catch method used for the webrequest exceptions we built in Challenge 4.

To strip the html tags from the text add the following code to a new class in the Visual Studio project.

using System;
using System.Text.RegularExpressions;

/// <summary>
/// Methods to remove HTML from strings.
/// </summary>
public static class HtmlRemoval
{
/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}

/// <summary>
/// Compiled regular expression for performance.
/// </summary>
static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

/// <summary>
/// Remove HTML from string with compiled Regex.
/// </summary>
public static string StripTagsRegexCompiled(string source)
{
return _htmlRegex.Replace(source, string.Empty);
}

/// <summary>
/// Remove HTML tags from string using char array.
/// </summary>
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;

for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
public static string RemoveJS(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{

if (source.Contains("java"))
{inside = true; continue; }

}
if (let == '>')
{
if (source.Contains("java"))
{ inside = false; continue; }

}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
}

Change the Crawler.cs file to the following:

using System.IO;
using System.Net;
using System;
using System.Text;
public class Crawler
{
public static void Crawl(string site)
{
try
{
//Create web request from URL array
WebRequest request = WebRequest.Create(site);
// If required by the server, set the credentials.
request.Credentials = CredentialCache.DefaultCredentials;
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
//Display Status
Console.WriteLine(response.StatusDescription);
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Display the content.
//Console.WriteLine (responseFromServer);
//write response to textfile
System.IO.File.WriteAllText(@"C:\Users\Devin\Dropbox\Devins\Websites\EngineerVerse\Crawler\Outputs\WriteText.txt", responseFromServer);

// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
Console.WriteLine("ready");
Console.ReadKey();
}
catch (WebException webExcp)
{
// If you reach this point, an exception has been caught.
Console.WriteLine("A WebException has been caught.");
// Write out the WebException message.
Console.WriteLine(webExcp.ToString());
// Get the WebException status code.
WebExceptionStatus status = webExcp.Status;
// If status is WebExceptionStatus.ProtocolError,
// there has been a protocol error and a WebResponse
// should exist. Display the protocol error.
if (status == WebExceptionStatus.ProtocolError)
{
Console.Write("The server returned protocol error ");
// Get HttpWebResponse so that you can check the HTTP status code.
HttpWebResponse httpResponse = (HttpWebResponse)webExcp.Response;
Console.WriteLine((int)httpResponse.StatusCode + " - "
+ httpResponse.StatusCode);
}

}
catch (Exception e)
{
// Code to catch other exceptions goes here.
Console.WriteLine(e);

}
}

public static void Crawl(string[] site)
{
for (int i = 0; i <= 100; i++)
{
try
{
//Create web request from URL array
WebRequest request = WebRequest.Create(site[i]);
// If required by the server, set the credentials.
request.Credentials = CredentialCache.DefaultCredentials;
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

//Display Status
Console.WriteLine(response.StatusDescription);
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
string unscriptedresponse = HtmlRemoval.StripTagsCharArray(responseFromServer);
// Display the content.
//Console.WriteLine (responseFromServer);
//write response to textfile
Console.WriteLine("saving file for site ... " + site[i]);
System.IO.File.WriteAllText(@"C:\Users\Devin\Dropbox\Devins\Websites\EngineerVerse\Crawler\Outputs\WriteText(" + i + ").txt", unscriptedresponse);

// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
Console.WriteLine("ready");
//Console.ReadKey();
}
catch (WebException webExcp)
{
// If you reach this point, an exception has been caught.
Console.WriteLine("A WebException has been caught.");
// Write out the WebException message.
Console.WriteLine(webExcp.ToString());
// Get the WebException status code.
WebExceptionStatus status = webExcp.Status;
// If status is WebExceptionStatus.ProtocolError,
// there has been a protocol error and a WebResponse
// should exist. Display the protocol error.
if (status == WebExceptionStatus.ProtocolError)
{
Console.Write("The server returned protocol error ");
// Get HttpWebResponse so that you can check the HTTP status code.
HttpWebResponse httpResponse = (HttpWebResponse)webExcp.Response;
Console.WriteLine((int)httpResponse.StatusCode + " - "
+ httpResponse.StatusCode);
}
continue;
}
catch (Exception e)
{
// Code to catch other exceptions goes here.
Console.WriteLine(e);
continue;
}
}
}

}

Now debug, compile and test the project. You will find the files that are saved in the directory will no longer contain the html tags.

Our final quest is to make sure we are only requesting safe, non-adult content.

All we need to do is implement the google query parameter for safe search.

Add the following to the CSE() function after “listRequest.Start = i;”:
listRequest.Safe = CseResource.ListRequest.SafeEnum.Medium;
Similar to the image below:

C# Tutorial-safe-search-parameter

C# Tutorial-safe-search-parameter

Lastly, debug, compile, and run.

Goals Completed:

Scrap only the text content
Avoid crawler traps
Avoid Blacklisted and unsafe web domains

Conclusion

You have built the basic components of a crawler from scratch. We created the basic crawler that downloads the page at any url. As a supporting role,  we created a google custom search engine using the google API. We used the top 100 results from a keyword search in our crawler, to capture the most relevant content to the given keyword. All results were saved without the html tags in separate files for later use. Finally, we made the process safe and useable as needed, by separating them into classes.

The fact still remains that data big or small is still intelligent by itself, but when used properly can be a rare and valuable asset.

My aim with this article was to gain my own experience with the topic, while offering valuable insight to the readers. If you felt like you have benefited from this, or have used bits, pieces, or its entirety in a project of your own, please feel free to share it with us below. 

Share this Tutorial

About the Author
Devin Bates

Devin Bates

Facebook Twitter Google+

Electrical Engineer, enjoys cookies, invents things in his mind, and thinks the Iguanas are to blame for the weekday.