How to build a RSS Parser in Python?

RSS is a way to distribute content on your website or meta-tag assets and distribute them to various platforms like Apple News, iTunes, Flipboard, and scores of other similar platforms. Owing to its support to perform Big-Data, Natural Language Processing, and simplicity Python has emerged as a popular programming language. This tutorial helps you build a RSS Parser using Python.

Installing FeedParser

Before going any further into coding system, please ensure you have Python 3.x installed on your machine and you have installed FeedParser into your Python environment. If you aren’t sure, then simply run the following from your command prompt or Terminal.

Pip install --upgrade feedparser

The above command will automatically check your system, install, upgrade, or simply confirm to you if the most updated version of FeedParser is installed on your machine.

FeedParser is one of the most popular libraries for parsing RSS XML Files.

Creating the RSS Aggregator class

Let’s now create a reusable class and call it WhizRSSAggregator.


class WhizRssAggregator():
    feedurl = ""

    def __init__(self, paramrssurl):
        print(paramrssurl)
        self.feedurl = paramrssurl

In the above lines, we have created a class and a constructor so any other piece of code can instantiate an object of this class with a RSS URL. We have also defined a class property called feedurl that will store the actual link to the RSS file and thereby makes it available to other functions within the WhizRssAggregator class.

Save the above file as whizrassagregator.py

Using FeedParser

It is now time to include FeedParser into your class. At the top of your py file, include the line import feedparser and save it. The new file should look like below:

import feedparser
class WhizRssAggregator():
    feedurl = ""

    def __init__(self, paramrssurl):
        print(paramrssurl)
        self.feedurl = paramrssurl

Parsing the RSS Feed

In the same py file, we are going to now add a new function called parse. This function will decipher and print the information contained in the RSS file. So let’s get started.

import feedparser

class WhizRssAggregator():
    feedurl = ""

    def __init__(self, paramrssurl):
        print(paramrssurl)
        self.feedurl = paramrssurl
        self.parse()

    def parse(self):
        thefeed = feedparser.parse(self.feedurl)

        print("Getting Feed Data")
        print(thefeed.feed.get("title", ""))
        print(thefeed.feed.get("link", ""))
        print(thefeed.feed.get("description", ""))
        print(thefeed.feed.get("published", ""))
        print(thefeed.feed.get("published_parsed",
                           thefeed.feed.published_parsed))

In the same py file, create a function called Parse. The cool think about using FeedParser is all you have to do is call feedparser.parse()   and supply the RSS URL stored in feedurl here. Feedparser does the difficult job of loading the RSS XML, validating it, parsing it, and simplifying it to a few calls.

If the feedparser.parse() is able to successfully validate it, you can simply start accessing the RSS information by calling the thefeed.feed.get() function with the name of the RSS XML Tag. When the computer executes this line, print(thefeed.feed.get(“title”, “”)) it will check the RSS XML file to see if thetag is defined and show the information contained within that tag. In the absence of that tag, it will publish an empty line and move over to publishing link, description, and others in the sequence. As the use of RSS has increased, they have also become flexible and I would strongly recommend using the thefeed.feed.get() as a way to keep your parser flexible.

Now, save the file whizrassagregator.py

Parsing the RSS Entries

Now that we have parsed the header information in the RSS XML, it is time to get the entries. Let’s continue to build our class.


import feedparser

class WhizRssAggregator():
    feedurl = ""

    def __init__(self, paramrssurl):
        print(paramrssurl)
        self.feedurl = paramrssurl
        self.parse()

    def parse(self):
        thefeed = feedparser.parse(self.feedurl)

        print("Getting Feed Data")
        print(thefeed.feed.get("title", ""))
        print(thefeed.feed.get("link", ""))
        print(thefeed.feed.get("description", ""))
        print(thefeed.feed.get("published", ""))
        print(thefeed.feed.get("published_parsed",
                           thefeed.feed.published_parsed))

        for thefeedentry in thefeed.entries:
            print("__________")
            print(thefeedentry.get("guid", ""))
            print(thefeedentry.get("title", ""))
            print(thefeedentry.get("link", ""))
            print(thefeedentry.get("description", ""))
            print("__________")

Testing your code

To test the code, create a separate python file called testrss.py and simply paste these two lines that import the WhizRssAggregator and create an instance of the newly created RSS Aggregator with a RSS URL to Parse.

from whizrssaggregator import WhizRssAggregator
rssobject=WhizRssAggregator("http://rss.cnn.com/rss/cnn_topstories.rss")

To see the output of your code, simply type the following in the command prompt window or Terminal. Please make sure your terminal or command prompt windows current folder is the location of testrss.py file.

python testrss.py

Conclusion

Using FeedParser and Python, you can very simply parse the RSS File. The information parsed from the RSS file can then be stored in a database or manipulated further to meet your needs. That is not it, FeedParser will also allow you to scan and manage custom tags that may be introduced via Namespaces. We will cover that in a subsequent tutorial.

Until then – have fun coding in python.

P.S. Click here to download the files via github.

Lastly, Don’t forget to check out the next post in this series on how to parse namespaces with the RSS Parser?

  1. […] In the last tutorial, we learned about how to build a Python based RSS Parser. Continuing that conversation and building on that tutorial, let’s now look at parsing Namespaces and Namespace specific elements. […]

    Like

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: