Screen Scraping Your Way Into RSS

Dennis Pallett

Introduction

RSS is one the hottest technologies at the moment, and even big web publishers such as the New York Times are getting into RSS as well. However, there are still a lot of websites that do not have RSS feeds.

If you still want to be able to check those websites in your favourite aggregator, you need to create your own RSS feed for those websites. This can be done automatically with PHP, using a method called screen scrapping. Screen scrapping is usually frowned upon, as its mostly used to steal content from other websites.

I personally believe that in this case, to automatically generate a RSS feed, screen scrapping is not a bad thing. Now, on to the code!

Getting the content

For this article, well use PHPit as an example, despite the fact that PHPit already has RSS feeds http://www.phpit.net/syndication/.

Well want to generate a RSS feed from the content listed on the frontpage http://www.phpit.net. The first step in screen scraping is getting the complete page. In PHP this can be done very easily, by using implodefile"", "[the url here]"; IF your web host allows it. If you cant use file youll have to use a different method of getting the page, e.g. using the CURL library http://www.php.net/curl.

Now that we have the content available, we can parse it for the content using some regular expressions. The key to screen scraping is looking for patterns that match the content, e.g. are all the content items wrapped in <div>s or something else If you can successfully discover a pattern, then you can use preg_match_all to get all the content items.

For PHPit, the pattern that match the content is <div class="contentitem">[Content Here]<div>. You can verify this yourself by going to the main page of PHPit, and viewing the source.

Now that we have a match we can get all the content items. The next step is to retrieve the individual information, i.e. url, title, author, text. This can be done by using some more regular expression and str_replace on the each content items.

By now we have the following code;

<php

// Get page
$url = "http://www.phpit.net/";
$data = implode"", file$url; 

// Get content items
preg_match_all "/<div class="contentitem">[^`]*</div>/", $data, $matches;

Like I said, the next step is to retrieve the individual information, but first lets make a beginning on our feed, by setting the appropriate header text/xml and printing the channel information, etc.

 
// Begin feed
header "Content-Type: text/xml; charset=ISO-8859-1";
echo "<xml version="1.0" encoding="ISO-8859-1" >
";
>
<rss version="2.0"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:admin="http://webns.net/mvcb/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel>
		<title>PHPit Latest Content</title>
		<description>The latest content from PHPit http://www.phpit.net, screen scraped!</description>
		<link>http://www.phpit.net</link>
		<language>en-us</language>


<

Now its time to loop through the items, and print their RSS XML. We first loop through each item, and get all the information we get, by using more regular expressions and preg_match. After that the RSS for the item is printed.

<php
// Loop through each content item
foreach $matches[0] as $match {
	// First, get title
	preg_match "/">[^`]*</a></h3>/", $match, $temp;
	$title = $temp[1];
	$title = strip_tags$title;
	$title = trim$title;

	// Second, get url
	preg_match "/<a href="[^`]*">/", $match, $temp;
	$url = $temp[1];
	$url = trim$url;

	// Third, get text
	preg_match "/<p>[^`]*<span class="byline">/", $match, $temp;
	$text = $temp[1];
	$text = trim$text;

	// Fourth, and finally, get author
	preg_match "/<span class="byline">By [^`]*</span>/", $match, $temp;
	$author = $temp[1];
	$author = trim$author;

	// Echo RSS XML
	echo "<item>
";
		echo "			<title>" . strip_tags$title . "</title>
";
		echo "			<link>http://www.phpit.net" . strip_tags$url . "</link>
";
		echo "			<description>" . strip_tags$text . "</description>
";
		echo "			<content:encoded><![CDATA[ 
";
		echo $text . "
";
		echo " ]]></content:encoded>
";
		echo "			<dc:creator>" . strip_tags$author . "</dc:creator>
";
	echo "		</item>
";
}
>

And finally, the RSS file is closed off.

</channel>
</rss>

Thats all. If you put all the code together, like in the demo script, then youll have a perfect RSS feed.

Conclusion

In this tutorial I have shown you how to create a RSS feed from a website that does not have a RSS feed themselves yet. Though the regular expression is different for each website, the principle is exactly the same.

One thing I should mention is that you shouldnt immediately screen scrape a websites content. E-mail them first about a RSS feed. Who knows, they might set one up themselves, and that would be even better.

Download sample script at http://www.phpit.net/viewsource.phpurl=/demo/screenscrape%20rss/example.php

About The Author

Dennis Pallett is a young tech writer, with much experience in ASP, PHP and other web technologies. He enjoys writing, and has written several articles and tutorials. To find more of his work, look at his websites at http://www.phpit.net, http://www.aspit.net and http://www.ezfaqs.com

RELATED ARTICLES

web page building for beginners 4
Page ranking by search engines such as Google are not only overrated but unusually explained by “experts” in the SEO field. SEO stands for search engine optimization and is needed to get your website placed at the top of a web search. I have no truck with someone wanting to get their pages placed in the number one or two spot from a search, but page ranking has nothing to do with that whatsoever and I can assure the reader that I can prove this to anyone with a computer. In fact, if you will just do a search for “writing critic” you will find www.homewriters.com near the top and since I own that site, I can explain to everyone how it got there and why. But at one time, I had four out of the top five spots on Google for this very same search. And by the way, my pages had no rank at all when they were first placed in the top positions by Google and Yahoo!

Why Should You Care About Your Web Site Colors
**How Colors are Used in web design:**

Can Your Visitors Contact You From Your Website
Argh, wheres the contact button! Believe it or not, some websites just DONT WANT any visitors even customers to CONTACT them. Theres no email address, no contact form, no NOTHING!

Tips For An Effective Website
A successful website is something that everyone venturing into the new arena of Internet marketing and communications wants to have. Your website will often be a potential client’s first impression of you and your business. That being said, it is vitally important that your website represent you in a favorable manner. Your website is a component of your overall marketing strategy and should be reflective of your business and current marketing presentation. Just as your brochures and other print materials represent a level of quality and professionalism, your website should also represent that same level of quality and professionalism. As a point of comparison, you most likely wouldn’t have just anyone design your company’s marketing materials or for that matter, prepare your tax statements. Similarly, it is important that the person creating your website know how to effectively communicate both your message and style, and be able to create an effective experience for those visiting your site.

Website Maintenance: What does it take to manage your website
Every action has a beginning phase where you launch your activity, a middle period in which you must sustain and develop it and a conclusion. Many people set up a website enthusiastically but then they find that they don’t have the skills or knowledge to maintain and develop it. So, what are the challenges in this middle phase of a website’s life and how can you meet them

Profitable Websites for Exclusive Industries
In the region where my company is located, South Bend, Indiana, the demand forweb design and associated services could be described as:leery.It’s not that the locale is horribly lacking evolvement technologically; rather that so many proprietors in this area are very uncertain as to what the internet can do and how they can go about forming profitable web sites for their exclusive industries.

What is PageRank
What is Pagerank - Pagerank is based on how many links to one website. PageRank is a value that represents how important a page is on the web. Robots or Spiders figures that when one page links to another page, it is casting a vote for the other page. Also the importance of the page that is casting the vote determines how important the vote is. It matters because it is one of the most important factors that determines a pages ranking in the search results but It isnt the only factor that Search enginesuses to rank pages. Not all links are counted by Search engines..Some links can cause a site to be penalized by Search engines.

Can They Use Your Website In A TV Broadcast About Scams
When you designed your website, you probably dreamt about how wonderful it would be if one day a popular TV program featured it, and you would become famous overnight, and make lots of $$$.

How Video Can Be Used Effectively Online
A powerful way to convey your communication messages to your audience is to be able to have your promotional video available on your website.

The Secret Benefit Of Accessibility Part 2: A Higher Search Engine Ranking
An additional benefit of website accessibility is an improved performance in search engines. The more accessible it is to search engines, the more accurately they can predict what the sites about, and the higher your site will appear in the rankings.

Track Your Visitors, Using PHP
There are many different traffic analysis tools, ranging from simple counters to complete traffic analyzers. Although there are some free ones, most of them come with a price tag. Why not do it yourself With PHP, you can easily create a log file within minutes. In this article I will show you how!

The Basic Facts About Business Web Hosting
There are a number of hosting options, from free to dedicated servers. Let’s spend a few moments and look over the options.

Graphic Design Using Color
Color is everywhere and conveys a message even if we don’t realize it. While this message can vary by culture it pays to know what colors “say” in your own corner of the universe, and even what color means to your target market.

Should You Renew With Your Current Host - Your Website In The Year 2005
Sometime during this year the chances are youre going to have to renew your hosting package. Can you be sure you’re still getting the best deal for your site

If Content is King, then surely Relevance is Queen!
There has been a lot of to-ing and fro-ing in the search engine world of late and there are lots of conspiracy theories as to why these things happen.

Basic Design Principles Part 2
Tone and Texture

Web Hosting Basics
So youre about to embark on your first web project and youre wondering where on earth to start Well, one of the first things youll need is a web hosting package. There are so many out there it can seem daunting to try and choose one. In this article well teach you the basics to help you make an informed decision.

Text Is King!
Are you building your website If so, STOP! Take a look at what you have done so far. How many images do you have How much text do you have

The DRC Blew It
The DRC Disability Rights Commission recently announced the results of their year-long investigation into the accessibility of 1000 websites. The DRCs report http://www.drc-gb.org/publicationsandreports/2.pdf concluded that most websites are highly inaccessible, with over 81% not even meeting basic standards of accessibility.

Building Web Sites Using Web Templates
Gone are the days when you had to rely on a web-designer to design your websites. New technologies with innovative ideas have brought out a new variant to this trend in the form of ready to use website templates.

home | site map