Screen Scraping with HtmlAgilityPack in Asp.Net C# and Vb.Net

← PrevNext →

A few days back I was on Quora, a question and answer website where I was submitting an answer to a question on .Net. The site has an amazing feature. When you type (or paste) the URL of a website in the answer box, it automatically parses or extracts the information (title) from the web page. This process is often known as Screen Scraping. Here, in this article I am going to show how to parse a web page conveniently using HtmlAgiltyPack library in Asp.Net.

HtmlAgilityPack Example

What is Screen Scraping?

Screen Scraping (similar to web scraping) is not a new concept. It’s a process of extracting a wide range of information from a web page, such as, meta descriptions, title and other vital details like current stock price etc. Facebook has this feature too and it delights its users by extracting not just title and Meta descriptions, but images too.

What is HtmlAgilityPack?

HtmlAgilityPack is a library (.dll) for .Net, which provides necessary methods and properties, using which a developer can conveniently extract any kind of information from a web page. There’s one thing that I found very useful and I thought is worth sharing, is that it can extract data even if the page has bad markup. In HTML, a tag starts with an opening and closing tag. If you have missed the closing tag, it will still extract data of that particular tag.

How Do I Install HtmlAgilityPack?

First, you need the HtmlAgility.Pack.dll library file in your machine. As I said, the library has the methods and properties for data extraction. Therefore, if you are using .Net 4 or later, you must have access to Nuget Packages with Visual Studio.

Create a new website using Visual Studio. Open Solution Explorer, right click solution and click Manage Nuget Packages… option.

Manage Nuget Package

In the Nuget packages window, type HtmlAgilityPack in the search box and click the Install button.

Search HtmlAgility with Nuget

In case, you could not install the library using Nuget package, you can straightway download the library from their website page. You will download a zip file, extract the file and copy the library (version HtmlAgilityPack.1.4.6 – Net20) inside the bin folder of your project. If you don’t find bin, create the folder in the root directory of your project.

Well, that’s it. You got the library. Now, let’s code.

The Markup

In the markup section, I have added few basic controls. I have a textbox control with AutoPostBack set to true. I wish to extract data when I enter the URL in the box. Therefore, I have added the ontextchanged event that will call a code behind procedure parseWeb.

To display the extracted data from the URL’s page, I have added a DIV element. Since, I’ll extract the “title” of page along with its “Meta” descriptions. Since I don’t know how many Meta tags the page has, I have not set a fixed height for the DIV.

<!DOCTYPE html>
<html>
<head>
    <title>Parse a Web Page with HtmlAgilityPack in Asp.Net</title>
</head>
<body>
    <form runat="server">
    <div class="page">
        <div class="main">

            <div style="line-height:18px; clear:both;">
                <div>
                    <asp:TextBox ID="tbEditor" 
                        placeholder="Enter the URL"
                        AutoPostBack="true" 
                        ontextchanged="parseWeb" 
                        Width="400px" Height="40px"
                        TextMode="MultiLine" 
                        runat="server">
                    </asp:TextBox>
                </div>
                <div id="divPageDescription" 
                    style="width:400px; padding:10px 0;" 
                    runat="server">
                </div>
            </div>
        </div>
    </div>
    </form>
</body>
</html>

Screen Scraping in C# using HTMLAgilityPack

Add the HtmlAgilityPack library in your project, by adding using statement.

using HtmlAgilityPack;

Here’s the complete code.

using System;
using HtmlAgilityPack;

public partial class SiteMaster : System.Web.UI.MasterPage
{
    protected void parseWeb(object sender, EventArgs e)
    {
        string url = null;
        url = tbEditor.Text;

        HtmlWeb HtmlWEB = new HtmlWeb();
        HtmlDocument HtmlDocument = HtmlWEB.Load(url);

        // First get the title of the web page.
        var sTitle = HtmlDocument.DocumentNode.SelectNodes("//title");
        divPageDescription.InnerHtml = "<b> Page title </b>: " + 
            sTitle["title"].InnerText + "<br />";

        // Now, parse <META> tag details.
        var metaTags = HtmlDocument.DocumentNode.SelectNodes("//meta");

        if (metaTags != null)
        {
            foreach (var tag in metaTags)
            {
                if ((tag.Attributes["name"] != null) & (tag.Attributes["content"] != null))
                {
                    divPageDescription.InnerHtml = divPageDescription.InnerHtml + "<br /> " + 
                        "<b> Page " + tag.Attributes["name"].Value + " </b>: " + 
                            tag.Attributes["content"].Value + "<br />";
                }
            }
        }

    }
}
Vb.Net

Use Import statement to get access to the HtmlAgilityPack methods and properties.

Imports HtmlAgilityPack

Option Explicit On

Imports HtmlAgilityPack

Partial Class Site
    Inherits System.Web.UI.MasterPage

    Protected Sub parseWeb(sender As Object, e As EventArgs)

        Dim url As String
        url = tbEditor.Text

        Dim HtmlWEB As HtmlWeb = New HtmlWeb()
        Dim HtmlDocument As HtmlDocument = HtmlWEB.Load(url)

        // First get the title of the web page.
        Dim sTitle = HtmlDocument.DocumentNode.SelectNodes("//title")
        divPageDescription.InnerHtml = "<b> Page title </b>: " & sTitle.Item("title").InnerText & "<br />"

        ' Now, parse <META> tag details.
        Dim metaTags = HtmlDocument.DocumentNode.SelectNodes("//meta")
        Dim tag

        If Not IsNothing(metaTags) Then
            For Each tag In metaTags
                If Not IsNothing(tag.Attributes("name")) And Not IsNothing(tag.Attributes("content")) Then
                    divPageDescription.InnerHtml = divPageDescription.InnerHtml & "<br /> " & _
                        "<b> Page " & tag.Attributes("name").value & " </b>: " & tag.Attributes("content").value & "<br />"
                End If
            Next
        End If

    End Sub
End Class

Hope you find this article and example useful for your project. You are now in possession of a library with which you can conveniently parse a web page and extract information for analyzing and other purposes.

Thanks for reading.

← PreviousNext →