Quantcast
Channel: Refactored i.T
Viewing all articles
Browse latest Browse all 10

Content Summaries with HtmlAgilityPack

$
0
0

Recently I had a client ask me to modify their Blog page to list only the first paragraph and the first image found in each of their posts instead of the full post on the default Blog Post List.

The site (http://deancambray.com.au) was built with Umbraco and utilises the Blog4Umbraco package along with our Extensions package.  The solution that we came up with was to use the already available HtmlAgilityPack that's included in the Umbraco distribution and write our own Razor Script to list the Blog Posts.

While the entire script also incorporates other features such as a custom numeric pager, I wanted to focus this time on just extracting certain elements out of each post and displaying them in a customised format.

The helper: RenderSummary

First things first: make sure you have a reference to the HtmlAgilityPack library near the top of the script:

@using HtmlAgilityPack;

Our helper looks like this:

@helper RenderSummary(dynamic node) {
    var doc = new HtmlDocument();
    doc.LoadHtml(node.BodyText.ToString());
    var imgNode = doc.DocumentNode.SelectSingleNode("//img[@src]"); 
    if (imgNode != null) {
        var url = imgNode.Attributes["src"].Value;
        string alt = string.Empty;
        string title = string.Empty;
        if (imgNode.Attributes["alt"] != null) { alt = imgNode.Attributes["alt"].Value; }
        if (imgNode.Attributes["title"] != null) { title = imgNode.Attributes["title"].Value; }<a href="@node.Url" title="Permalink to @node.Name"><img src="@url" alt="@alt" title="@title" /></a>
    }
    var para = doc.DocumentNode.SelectNodes("//p");
    if (para != null) {
        foreach (var p in para) {
            if (string.IsNullOrWhiteSpace(p.InnerText.Replace(" ", ""))) { continue; }<p>@Html.Raw(p.InnerText)</p>
            break;                                         
        }
    }
}

Our script uses a helper to render the Summary of each post that was found, and instantiates a new HtmlAgilityPack.HtmlDocument for each article by loading the article content using LoadHtml.  Once that's done, we can then use standard xpath queries to select the content that we want.  In this case, we want to find the first image that may be contained in the article and the first non-empty paragraph.

We can check that an image or paragraph exists by the return value of the SelectSingleNode or SelectNodes methods making it very easy to conditionally display the image or a placeholder if desired, for example.

Once we have our image, it's a trivial matter to extract the source url and other attributes using the Attributes collection on the returned HtmlNode and building our custom <img> tag.

Because it is very easy to insert paragraphs through TinyMCE that are empty, we want to find the first paragraph that actually has visible content in it. Otherwise our summary will look very empty indeed.  Once we have found the right paragraph, we can use the InnerText property to extract just the textual elements and ignore things like embedded images, lists and line breaks.  This results in a cleaner display and guarantees that the image (which may be found within the first paragraph) is not shown twice.

Note that you could also use theInnerHtml property instead if you wanted to include the extra format elements and other bits and pieces.

Tying it together

OUr BlogListPosts script is intended to replace the XSLT counterpart provided with Blog4Umbraco, so I've taken the basic structure of that script and tidied it up somewhat for clarity.  I've removed part of it that does the filtering and paging of the list items based on category and/or archive folder.  I wanted to focus on just the Summary rendering, so here's a condensed version of the body of the script featuring the use of the RenderSummary helper defined above:

@{
    var list = Current.DescendantsOrSelf("BlogPost").Items.OrderByDescending(n => n.GetPropertyValue("PostDate"));

    foreach (dynamic post in list)
    {
        <div class="post"><h2 class="entry-title"><a href="@post.Url" title="Permalink to @post.Name">@post.Name</a></h2><div class="entry-date"><small class="published">@post.PostDate.ToString("dddd, MMM dd, yyyy")</small></div><div class="entry-content summary">
                @RenderSummary(post)</div><div class="footer"><small class="more"><a href="@post.Url" title="Permalink to @post.Name">Read More...</a></small></div></div>
    }
}

Find this post helpful?  Why don't you drop us a line in the comments below...


Viewing all articles
Browse latest Browse all 10

Latest Images

Trending Articles





Latest Images