Home

SiteMap

How to use HtmlAgilityPack in Asp.Net

← Prev

I am sure you might have come across the term Web Scraping. Ok, web scraping, also known as Screen Scraping is a process of extracting information from a web page. Information such as "title", meta descriptions, "images" etc. In this tutorial I am going to show you how to web scrape or extract information from a website automatically using HtmlAgilityPack library in Asp.Net using C# and Vb. Also I'll explain with examples What is HtmlAgilityPack used for?

Get title and meta description of a webpage using C#

See this demo

What is HtmlAgilityPack?

HtmlAgilityPack (or simply Html Agility Pack) is a library (.dll) for .Net that provides necessary methods and properties that enables developers (C# and VB developers) to conveniently extract (or parse) and/or manipulate HTML documents.

There’s one thing I found very interesting about HtmlAgilityPack, is that it can extract data even if the page has bad markup. What is bad markup? In HTML, a tag starts with an opening and closing tag. If you have missed the closing tag, it will still extract data of that particular tag or element.

HtmlAgilityPack is very easy to use. I'll show you how.

How to install HtmlAgilityPack?

First, let me show you how to install HtmlAgilityPack. Its a library that you need to install in your computer.

There a two ways you can install the library.

1) Install HtmlAgilityPack using Nuget".

If you are using .Net 4 or later, you must have access to Nuget Packages with Visual Studio.
Follow these steps.

a) Create a new website using Visual Studio

b) Open "Solution Explorer", right click solution and click Manage Nuget Packages… option.

manage nuget in asp.net

c) In the Nuget packages window, type HtmlAgilityPack in the search box and click the Install button.

2) In case you don't have access to Nuget or you could not install the library using Nuget package, you can straightway download the library from their website page.

It will download a zip file, extract the file and copy the library inside the bin folder of your project. If you don’t find bin, create the folder in the root directory of your project.

Now, lets see some examples. All examples have both C# and VB codes.

Example 1: Get Metadata of a web page

In this example, we'll extract metadata that is available in the web page. All you'll need, is the URL or the address of the web page.

Remember: Metadata in a webpage, provides information about the web page or the HTML document. These data are assigned using the <meta> tag.

C# Code
using System;
using HtmlAgilityPack;

public partial class SiteMaster : System.Web.UI.MasterPage
{
    protected void Page_Load(object sender, EventArgs e)
    {
        string url = "https://www.encodedna.com/google-chart/make-charts-using-json-data-dynamically.htm";

        HtmlWeb HtmlWEB = new HtmlWeb();
        HtmlDocument HtmlDocument = HtmlWEB.Load(url); // Load the web page.

        // Parse <meta> tag details of a web page.
        var metaTags = HtmlDocument.DocumentNode.SelectNodes("//meta");

        if (metaTags != null)
        {
            foreach (var tag in metaTags)
            {
                if ((tag.Attributes["name"] != null) & (tag.Attributes["content"] != null))
                {
                    div.InnerHtml = div.InnerHtml + "<br /> " +
                        "<b> Page " + tag.Attributes["name"].Value + " </b>: " + tag.Attributes["content"].Value + "<br />";
                }
            }
        }
    }
}

Output:

get metadata using HtmlAgilityPack in asp.net

See this demo
Vb Code (get matadata)
Option Explicit On
Imports HtmlAgilityPack

Partial Class Site
    Inherits System.Web.UI.MasterPage

    Protected Sub Page_Load(sender As Object, e As EventArgs) Handles Me.Load
        Dim url As String = "https://www.encodedna.com/google-chart/make-charts-using-json-data-dynamically.htm"

        Try
            Dim HtmlWEB As HtmlWeb = New HtmlWeb()
            Dim HtmlDocument As HtmlDocument = HtmlWEB.Load(url)

            ' Parse <meta> tag details of a web page.
            Dim metaTags = HtmlDocument.DocumentNode.SelectNodes("//meta")
            Dim tag

            If Not IsNothing(metaTags) Then
                For Each tag In metaTags
                    If Not IsNothing(tag.Attributes("name")) And Not IsNothing(tag.Attributes("content")) Then
                        divPageDescription.InnerHtml = divPageDescription.InnerHtml & "<br /> " & _
                            "<b> Page " & tag.Attributes("name").value & " </b>: " & tag.Attributes("content").value & "<br />"
                    End If
                Next
            End If
        Catch ex As Exception
        Finally
        End Try
    End Sub
End Class


Exampe 2: Get all Images with details on a web page

Web pages may or may not have images. The following example shows how to extract (parse) details about images on a web page.

C# Code
using System;
using HtmlAgilityPack;

public partial class SiteMaster : System.Web.UI.MasterPage
{
    protected void Page_Load(object sender, EventArgs e)
    {
        string url = "https://www.encodedna.com/google-chart/make-charts-using-json-data-dynamically.htm";

        HtmlWeb HtmlWEB = new HtmlWeb();
        HtmlDocument HtmlDocument = HtmlWEB.Load(url); // Load the web page.

        // Parse <img> tag details of a web page. (get image details)
        var imgTags = HtmlDocument.DocumentNode.SelectNodes("//img");

        if (imgTags != null)
        {
            foreach (var tag in imgTags)
            {
                if (tag.Attributes["src"].Value != null)
                {
                    div.InnerHtml = div.InnerHtml + "<br /> " +
                        "<b>Image</b>: " + tag.Attributes["src"].Value + " <br/> <b>Alt text</b>: " + tag.Attributes["alt"].Value + "<br />";
                }
            }
        }
    }
}
VB Code
Option Explicit On
Imports HtmlAgilityPack

Partial Class Site
    Inherits System.Web.UI.MasterPage

    Protected Sub Page_Load(sender As Object, e As EventArgs) Handles Me.Load
        Dim url As String = "https://www.encodedna.com/google-chart/make-charts-using-json-data-dynamically.htm"
        'url = tbEditor.Text

        Try
            Dim HtmlWEB As HtmlWeb = New HtmlWeb()
            Dim HtmlDocument As HtmlDocument = HtmlWEB.Load(url)

            ' Parse <img> tag details of a web page.
            Dim imgTags = HtmlDocument.DocumentNode.SelectNodes("//img")
            Dim tag

            If Not IsNothing(imgTags) Then
                For Each tag In imgTags
                    If Not IsNothing(tag.Attributes("src")) Then
                        div.InnerHtml = div.InnerHtml & "<br /> " & _
                            "<b>Image</b>: " & tag.Attributes("src").value & "<br /> <b>Alt text</b>: " & tag.Attributes("alt").value & "<br />"
                    End If
                Next
            End If
        Catch ex As Exception

        Finally

        End Try
    End Sub
End Class

That's it.

I have shared two examples here in this tutorial, showing how to extract "metadata" and "image details" of a web page. You can however extract (or get) more information from a web page that you may find essential using HtmlAgilityPack library in Asp.Net.

Hope you find this information useful.

Happy coding.

← Previous