Friday, 26 October 2012

Doing RSS right - retrieving content

Feeds, usually RSS but sometimes Atom or other formats, are a convenient way of including syndicated content into web pages - indeed the last 'S' of 'RSS' stands for 'syndication' in one of the two possible ways of expanding the acronym.

The obvious way to include the content of a feed in a dynamically-generated web page (such as the 'News' box on the University's current home page) is to include in the code that generates the page something that retrieves the page's feed data, parses it, and then marks it up and includes it in the page.

But this obvious approach comes with some drawbacks. Firstly the process of retrieving and parsing the feed may be slow and may be resource intensive. Doing this on every page load may slow down page rendering and will increase the load on the web server doing the work - it's easy to forget that multiple page renderings can easily run in parallel if several people look at the same page at about the same time.

Secondly, fetching the feed on every page load could also throw an excessive load on the server providing the feed - this is at least impolite and could trigger some sort of throttling or blacklisting behaviour.

And thirdly there's the problem of what happens if the source of the feed becomes unreachable? Unless it's very carefully written the retrieval code will probably hang, waiting for the feed to arrive, probably preventing the entire page from rendering and giving the impression that you site is down, or at least very slow. And even if the fetching code can quickly detect that the feed really isn't going to be available (and doing that is harder than it sounds), what do you then display in your news box (or equivalent)?

A better solution is to separate out the fetching part of the process from the page rendering part. Get a background process (a cron job, say, or a long ruining background thread) to periodically fetch the feed and cache it somewhere local, say in a file, in a database, or in memory for real speed. While it's doing this it it might as well check the feed for validity and only replace the cached copy if it passes. This process can use standard HTTP mechanisms to check for changes in the feed and so only transfer it when actually needed - it's likely to need to remember the feeds last modification timestamp from every fetch to make this work.

That way, once you've retrieved it once you'll always have something to display even if the feed becomes unavailable or the content you retrieve is corrupt. It would be a good idea to alert someone if this situation persists, otherwise the failure might go un-noticed, but don't do so immediately or on every failure since it seems common for some feeds to be at least temporally unavailable. Since the fetching job is parsing the feed it could store the parsed result in some easily digestible format to further reduce the cost of rendering the content into the relevant pages.

Of course this, like most caching strategies, has the drawback that there will now be a delay between the feed updating and the change appearing on your pages - in some circumstances the originators of feeds seem very keen that any changes are visible immediately. In practice, as long as they know what's going on they seem happy to accept a short delay. There's also the danger that you will be fetching (or at least checking) a feed that no longer used or very rarely viewed. Automatically keeping statistics on how often a particular feed is actually included in page would allow you to tune the fetching process (automatically or manually) to do the right thing.

If you can't do this, perhaps because you are stuck with a content management system that insists on doing things it's way, then one option might be to arrange to fetch all feeds via a local caching proxy. That way the network connections being made for each page view will be local and should succeed. Suitable configuration of the cache should let you avoid hitting the origin server too often, and you may even be able to get it to continue to serve stale content if the origin server becomes unavailable for a period of time.

See also Doing RSS right (2) - including content and Doing RSS right (3) - character encodings.

No comments:

Post a Comment