How to Parse Namespaces using the Python RSS Parser?

In the last tutorial, we learned about how to build a Python based RSS Parser. Continuing that conversation and building on that tutorial, let’s now look at parsing Namespaces and Namespace specific elements.

Getting Ready

For the purpose of this tutorial, we will use the WhizRssAggregator.py file that we created in the previous tutorial.

Parsing Namespaces

Let’s extend the RSS Aggregator file below.

import feedparser

class WhizRssAggregator():
    feedurl = ""

    def __init__(self, paramrssurl):
        print(paramrssurl)
        self.feedurl = paramrssurl
        self.parse()

    def parse(self):
        thefeed = feedparser.parse(self.feedurl)

        print("Getting Feed Data")
        print(thefeed.feed.get("title", ""))
        print(thefeed.feed.get("link", ""))
        print(thefeed.feed.get("description", ""))
        print(thefeed.feed.get("published", ""))
        print(thefeed.feed.get("published_parsed",
                           thefeed.feed.published_parsed))

        for thefeedentry in thefeed.entries:
            print("__________")
            print(thefeedentry.get("guid", ""))
            print(thefeedentry.get("title", ""))
            print(thefeedentry.get("link", ""))
            print(thefeedentry.get("description", ""))
            print("__________")

            # Parsing Namespaces
            for thefeednamespace in thefeed.namespaces:
                if (thefeednamespace == "media"):
                    # parse for Yahoo Media
                    print("Media")
                    allmediacontent = thefeedentry.get("media_content", "")
                    for themediacontent in allmediacontent:
                        print(themediacontent["url"])
                        print(themediacontent["height"])
                        print(themediacontent["width"])</pre>

In the above code snippet that follows the Parsing Namespaces comment, you use yet another powerful capability of FeedParser. By simply referencing thefeed.namespaces, you can retrieve the list of namespaces referenced in the RSS XML Document. You can then iterate through the namespace. In the example above, we assume that the “media” namespace is referenced in the RSS XML Document.

The media namespace uses a series of tags to define its content. Using feedparser, you can access the tag defined within a namespace by referencing it as namespace_tagname.

In this example, since we are referencing the  tags defined within the namespace, you can simply use the get() function with the “media_content” parameter. This returns all of the items using the tags defined within the context of the “media” namespace. You can simply iterate and print each sub-tag or attribute. In this example, print(themediacontent[“url”]) simply prints the link to the media content which is an attribute of the content tag.

Conclusion 

Most RSS Documents use multiple namespaces. By using the namespace feature and iterating through the document, you can very easily factor in various popular namespaces.

I hope this was helpful. Have fun coding in python.

P.S. Click here to download the files via github.

  1. […] Lastly, Don’t forget to check out the next post in this series on how to parse namespaces with… […]

    Like

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: