Saturday, 27 April 2013
Why I think php is a bad idea
Update: a friend reminds me of http://me.veekun.com/blog/2012/04/09/php-a-fractal-of-bad-design/ which covers the same topic from a different angle.
Thursday, 31 January 2013
Getting your fonts from the cloud
The University of Cambridge's latest web style, due for deployment RSN, uses Adobe Myriad Pro for some of its headings. This is loaded as a web font from Adobe's TypeKit service. As I understand it this is the only legal way to use Adobe Myriad Pro since Adobe don't allow self-hosting.
Typekit is apparently implemented on a high-availability Content Delivery Network (though even that isn't perfect - see for example here), but the question remains of what the effect will be if it can't be reached. Obviously the font won't be available, but we have web-safe fall-backs available. The real question is what sort of delay might we see under these circumstances. Ironically, one group who are particularly exposed to this risk are University users since at the moment we only have one connection to JANET, and so to the Internet and all the TypeKit servers.
TypeKit fonts are loaded by loading a JavaScript library in the head of each document and then calling a initialisation function:
Typekit is apparently implemented on a high-availability Content Delivery Network (though even that isn't perfect - see for example here), but the question remains of what the effect will be if it can't be reached. Obviously the font won't be available, but we have web-safe fall-backs available. The real question is what sort of delay might we see under these circumstances. Ironically, one group who are particularly exposed to this risk are University users since at the moment we only have one connection to JANET, and so to the Internet and all the TypeKit servers.
TypeKit fonts are loaded by loading a JavaScript library in the head of each document and then calling a initialisation function:
<script type="text/javascript" src="//use.typekit.com/<licence token>.js"></script>
<script type="text/javascript">try{Typekit.load();}catch(e){}</script>
Web browsers block while loading JavaScript like this, so if use.typekit.com can't be reached then page loading will be delayed until the attempt times out. How long will this be?
Some experiments suggest it's very varied, and varies between operating systems, browsers, and types of network connection. At best, loss of access to TypeKit results in an additional 3 or 4 second delay in page loading (this is actually too small, see correction below). At worst this delay can be a minute or two. iOS devices, for example, seem to consistently see an additional 75 second delay. These delays apply to every page load since browsers don't seem to cache the failure.
Users are going to interpret this as somewhere between the web site hosting the pages going slowly and the web site being down. It does mean that for many local users, loss of access to TypeKit will cause them to loose usable access to any local pages in the new style.
Of course similar considerations apply to any 'externally' hosted JavaScript. One common example is the code to implement Google Analytics. However in this case its typically loaded at the bottom of each page and so shouldn't delay page rendering. This isn't an option for a font unless you can cope with the page initially rendering in the wrong font and then re-rendering subsequently.
I also have a minor concern about loading third-party JavaScript. Such JavaScript can in effect do whatever it wants with your page. In particular it can monitor form entry and steal authentication tokens
such as cookies. I'm not for one moment suggesting that Adobe would deliberately do such things, but we don' know much about how this JavaScript is managed and delivered to us so it's hard to evaluate the risk we might be exposed to. In view of this it's likely that at least the login pages for our central authentication system (Raven) may not be able to use Myriad Pro.
Update: colleagues have noticed a problem with my testing methodology which means that some of my tests will have been overly-optimistic about the delays imposed. It now appears that at best, loss of access to TypeKit results in an additional 20-30 seconds delay in page loading. That's a long time waiting for a page.
Further update: another colleague has pointed out that TypeKit's suggested solution to this problem is to load the JavaScript asynchronously. This has the advantage of allowing you to control the time-out process and decide when to give up and use fall-back fonts, but has the drawback that it requires custom CSS to hide the flash of unstyled text that can occur while fonts are loading.
Update: colleagues have noticed a problem with my testing methodology which means that some of my tests will have been overly-optimistic about the delays imposed. It now appears that at best, loss of access to TypeKit results in an additional 20-30 seconds delay in page loading. That's a long time waiting for a page.
Further update: another colleague has pointed out that TypeKit's suggested solution to this problem is to load the JavaScript asynchronously. This has the advantage of allowing you to control the time-out process and decide when to give up and use fall-back fonts, but has the drawback that it requires custom CSS to hide the flash of unstyled text that can occur while fonts are loading.
Sunday, 27 January 2013
Restricting web access based on physical location
Occasionally people want to restrict access to a web-based resource based not on who is accessing it but on where they are located when they do so. This is normally to comply with some sort of copyright licence. In UK education this is, more often that not, something to do with the educational recording licences offered by ERA.
Unfortunately this is difficult to do, and close to impossible to do reliably. This often puzzles people, given that the ERA licences expect it and that things like BBC iPlayer are well known to be already doing it. It's a long story...
Because of the way the Internet works it's currently impossible to know, reliably, where the person making a request is physically located. It is however possible to guess, but you need to understand the limitations of this guessing process before relying on it. Whether this guessing process is good enough for any particular purpose is something only people using it can decide.
A common approach is based on Internet Protocol (IP) addresses. When someone requests something from a web server, one of the bits of information that the server sees is the IP address of the computer from which the request came (much as your telephone can tell you the number of the person calling you). In many cases this will be address assigned to the computer the person making the request is sitting at. IP addresses are generally assigned on a geographic basis and lists exist of what addresses are used where, so it is in principle possible to ask the question 'Did my server receive this request from a machine in the UK', or even '...in my institution'.
But there are catches:
Another tempting approach is that modern web browsers, especially those on devices with GPSs such as mobile phones, can be asked to supply the user's location. This is used, for example, to put 'you are here' markers on maps. You might think that this information could be used to implement geographic restrictions. However the fundamental problem with this is that it's under the user's control, so in the end they can simply make their browser lie. Further it's often inaccurate or may not be available (for example in a desktop browser) so all in all this probably isn't a usable solution.
If you can setup authentication such that you can identify all your users then it seems to me that one approach would simply be to impose terms and conditions that prohibit them from accessing content when not physically in the UK, or wherever. You could back this up by warning them if IP address recognition or geo-location suggests that they are outside the relevant area. It seems to me (but IANAL) that this might be sufficient to meet contractual obligations (or at last to provide a defence after failing), but obviously I can't advise on any particular case.
Unfortunately this is difficult to do, and close to impossible to do reliably. This often puzzles people, given that the ERA licences expect it and that things like BBC iPlayer are well known to be already doing it. It's a long story...
Because of the way the Internet works it's currently impossible to know, reliably, where the person making a request is physically located. It is however possible to guess, but you need to understand the limitations of this guessing process before relying on it. Whether this guessing process is good enough for any particular purpose is something only people using it can decide.
A common approach is based on Internet Protocol (IP) addresses. When someone requests something from a web server, one of the bits of information that the server sees is the IP address of the computer from which the request came (much as your telephone can tell you the number of the person calling you). In many cases this will be address assigned to the computer the person making the request is sitting at. IP addresses are generally assigned on a geographic basis and lists exist of what addresses are used where, so it is in principle possible to ask the question 'Did my server receive this request from a machine in the UK', or even '...in my institution'.
But there are catches:
- It's possible to route requests through multiple computers, in which case the server only see the address of the last one. This often happens without the user knowing about it (for example most home broadband set-ups route all connections through the house's broadband router, mobile networks route requests through central proxies, etc.), but it can also be done deliberately. Like many organisations, the University provides a Virtual Private Network service explicitly so that requests made from anywhere in the world can appear to be coming from a computer inside the University.
- The lists saying which addresses are used where are inevitably inaccurate. From example a multi-national company might have a block of addresses allocated to its US headquarters but, unknown to anyone outside the company, actually use some of them for its UK offices. Connections from people in the UK office would then appear to be from the US.
Another tempting approach is that modern web browsers, especially those on devices with GPSs such as mobile phones, can be asked to supply the user's location. This is used, for example, to put 'you are here' markers on maps. You might think that this information could be used to implement geographic restrictions. However the fundamental problem with this is that it's under the user's control, so in the end they can simply make their browser lie. Further it's often inaccurate or may not be available (for example in a desktop browser) so all in all this probably isn't a usable solution.
If you can setup authentication such that you can identify all your users then it seems to me that one approach would simply be to impose terms and conditions that prohibit them from accessing content when not physically in the UK, or wherever. You could back this up by warning them if IP address recognition or geo-location suggests that they are outside the relevant area. It seems to me (but IANAL) that this might be sufficient to meet contractual obligations (or at last to provide a defence after failing), but obviously I can't advise on any particular case.
Monday, 5 November 2012
Doing RSS right (3) - character encoding
OK, I promise I'll shut up about RSS after this posting (and my previous two).
This posting is about one final problem in including text from RSS feeds, or Atom feeds, or almost anything else, into web pages. The problem is that text is made up of characters and that 'characters' are an abstraction that computers don't understand. What computers ship around (across the Internet, in files on disk, etc.) while we are thinking about characters are really numbers. To convert between a sequence of numbers and a sequence of characters you need some sort of encoding, and the problem is that there are lots of these and they are all different. In theory if you don't know the encoding you can't do anything with number-encoded text. However most of the common encodings use the numbers from the ASCII encoding for common letters and others symbols. So in practice lot of English and European text will come out right-ish even if it's being decoded based on the wrong encoding.
But once you move away from the characters in ASCII (A-Z, a-z, 0-9, and a selection of other common ones) to the slightly more 'esoteric' ones -- pound sign, curly open and close quotation marks, long typographic dashes, almost any common character with an accent, and any character from a non-European alphabet -- then all bets are off. We've all seen web pages with strange question mark characters (like this �) or boxes where quotation marks should be, or with funny sequences of characters (often starting Â) all over them. These are both classic symptoms of character encoding confusion. It turns out there's a word to describe this effect: 'Mojibake'.
Now I'm not going to go into detail here about what the various encodings look like, how you work with them, how you can convert from one to another, etc. That's a huge topic, and in any case the details will vary depending on which platform you are using. There's what I think is a good description of some of this at the start of chapter 4 of 'Dive into Python3' (and this applies even if you are not using Python). But if you don't like this there are lots of other similar resources out there. What I do want to get across is that if you take a sequence of numbers repenting characters from one document and insert those numbers unchanged into another document then that's only going to work reliably if the encodings of the two documents are identical. There's a good chance that doing this wrong may appear to work as long as you restrict yourself to the ASCII characters, but sooner or later you will hit something that doesn't work.
What you need to do to get this right is to convert the numbers from the source document into characters according to the encoding of your source document, and then convert those characters back into numbers based on the encoding of your target. Actually doing this is left as an exercise for the reader.
If your target document is an HTML one then there's an alternative approach. In HTML (and XML come to that) you can represent almost any character using a numeric character entity based on the Universal Character Set from Unicode. If you always represent anything not in ASCII this way then the representation of you document will only contain ASCII characters, and these come out the same in most common encodings. So if someone ends up interpreting your text using the wrong encoding (and that someone could be you if, for example, you edit you document with an editor that gets character encoding wrong) there's a good chance it won't get corrupted. You should still clearly label such documents with a suitable character encoding. This is partly because (as explained above) it is, at least in theory, impossible to decode a text document without this information, but also because doing so helps to defend against some other problems that I might describe in a future posting.
This posting is about one final problem in including text from RSS feeds, or Atom feeds, or almost anything else, into web pages. The problem is that text is made up of characters and that 'characters' are an abstraction that computers don't understand. What computers ship around (across the Internet, in files on disk, etc.) while we are thinking about characters are really numbers. To convert between a sequence of numbers and a sequence of characters you need some sort of encoding, and the problem is that there are lots of these and they are all different. In theory if you don't know the encoding you can't do anything with number-encoded text. However most of the common encodings use the numbers from the ASCII encoding for common letters and others symbols. So in practice lot of English and European text will come out right-ish even if it's being decoded based on the wrong encoding.
But once you move away from the characters in ASCII (A-Z, a-z, 0-9, and a selection of other common ones) to the slightly more 'esoteric' ones -- pound sign, curly open and close quotation marks, long typographic dashes, almost any common character with an accent, and any character from a non-European alphabet -- then all bets are off. We've all seen web pages with strange question mark characters (like this �) or boxes where quotation marks should be, or with funny sequences of characters (often starting Â) all over them. These are both classic symptoms of character encoding confusion. It turns out there's a word to describe this effect: 'Mojibake'.
Now I'm not going to go into detail here about what the various encodings look like, how you work with them, how you can convert from one to another, etc. That's a huge topic, and in any case the details will vary depending on which platform you are using. There's what I think is a good description of some of this at the start of chapter 4 of 'Dive into Python3' (and this applies even if you are not using Python). But if you don't like this there are lots of other similar resources out there. What I do want to get across is that if you take a sequence of numbers repenting characters from one document and insert those numbers unchanged into another document then that's only going to work reliably if the encodings of the two documents are identical. There's a good chance that doing this wrong may appear to work as long as you restrict yourself to the ASCII characters, but sooner or later you will hit something that doesn't work.
What you need to do to get this right is to convert the numbers from the source document into characters according to the encoding of your source document, and then convert those characters back into numbers based on the encoding of your target. Actually doing this is left as an exercise for the reader.
If your target document is an HTML one then there's an alternative approach. In HTML (and XML come to that) you can represent almost any character using a numeric character entity based on the Universal Character Set from Unicode. If you always represent anything not in ASCII this way then the representation of you document will only contain ASCII characters, and these come out the same in most common encodings. So if someone ends up interpreting your text using the wrong encoding (and that someone could be you if, for example, you edit you document with an editor that gets character encoding wrong) there's a good chance it won't get corrupted. You should still clearly label such documents with a suitable character encoding. This is partly because (as explained above) it is, at least in theory, impossible to decode a text document without this information, but also because doing so helps to defend against some other problems that I might describe in a future posting.
Labels:
Atom,
data feeds,
RSS,
web technologies
Friday, 26 October 2012
Doing RSS right (2) - including content
In addition to the issues I described in 'Doing RSS right', there's another problem with RSS feeds, though at least this one doesn't apply to Atom.
The problem is that there's nothing in RSS to say if the various blocks of text are allowed to contain markup, and if so which. Apparently (see here):
<strong>Boo!</strong>
in feed data you simply can't know if the author intended it as an example of HTML markup, in which case you should escape the brackets before including them in your page, or as 'Boo!', in which case you probably expected to include the data as it stands.
And if you are expected to include the data as it stands you have the added problem that including HTML authored by third parties in your pages is dangerous. If they get their HTML wrong they could wreck the layout of you page (think missing close tag) and, worse, they could inject JavaScript into your pages or open you up to cross site scripting attacks by others. As I wrote here and here, if you let other people add any content to your pages then you are essentially giving them editing rights to the entire page, and perhaps the entire site.
However, given how things are and unless you know from agreements or documentation that a feed will only ever contain text then you are going to have to assume that the content includes HTML. Stripping out all the tags would be fairly easy, but probably isn't going to be useful because it will turn the text into nonsense - think of a post that includes a list.
The only safe way to deal with this is to parse the content and then only allow that subset of HTML tags and/or attributes that you believe to be safe. Don't fall for the trap of trying to filter out only what you consider to be dangerous because that's almost impossible to get right, and don't let all attributes through because they can be dangerous too - consider <a href="javascript:...">.
What should you let through? Well, that's hard to say. Most of the in-line elements, like <b>, <strong>, <a> (carefully), etc. will probably be needed. Also at least some block level stuff - <p>, <div>, <ul>, <ol>, etc. And note that you will have to think carefully about the character encoding both of the RSS feed and the page you are substituting it into, otherwise you might not realise that +ADw-script+AD4- could be dangerous (hint: take a look at UTF7)
If at all possible I'd try to avoid doing this yourself and use a reputable library for the purpose. Selecting such a library is left as an exercise for the reader.
See also Doing RSS right (3) - character encoding.
The problem is that there's nothing in RSS to say if the various blocks of text are allowed to contain markup, and if so which. Apparently (see here):
"Userland's RSS reader—generally considered as the reference implementation—did not originally filter out HTML markup from feeds. As a result, publishers began placing HTML markup into the titles and descriptions of items in their RSS feeds. This behavior has become expected of readers, to the point of becoming a de facto standard"This isn't just difficult, it's unresolvable. If you find
<strong>Boo!</strong>
in feed data you simply can't know if the author intended it as an example of HTML markup, in which case you should escape the brackets before including them in your page, or as 'Boo!', in which case you probably expected to include the data as it stands.
And if you are expected to include the data as it stands you have the added problem that including HTML authored by third parties in your pages is dangerous. If they get their HTML wrong they could wreck the layout of you page (think missing close tag) and, worse, they could inject JavaScript into your pages or open you up to cross site scripting attacks by others. As I wrote here and here, if you let other people add any content to your pages then you are essentially giving them editing rights to the entire page, and perhaps the entire site.
However, given how things are and unless you know from agreements or documentation that a feed will only ever contain text then you are going to have to assume that the content includes HTML. Stripping out all the tags would be fairly easy, but probably isn't going to be useful because it will turn the text into nonsense - think of a post that includes a list.
The only safe way to deal with this is to parse the content and then only allow that subset of HTML tags and/or attributes that you believe to be safe. Don't fall for the trap of trying to filter out only what you consider to be dangerous because that's almost impossible to get right, and don't let all attributes through because they can be dangerous too - consider <a href="javascript:...">.
What should you let through? Well, that's hard to say. Most of the in-line elements, like <b>, <strong>, <a> (carefully), etc. will probably be needed. Also at least some block level stuff - <p>, <div>, <ul>, <ol>, etc. And note that you will have to think carefully about the character encoding both of the RSS feed and the page you are substituting it into, otherwise you might not realise that +ADw-script+AD4- could be dangerous (hint: take a look at UTF7)
If at all possible I'd try to avoid doing this yourself and use a reputable library for the purpose. Selecting such a library is left as an exercise for the reader.
See also Doing RSS right (3) - character encoding.
Doing RSS right - retrieving content
Feeds, usually RSS but sometimes Atom or other formats, are a convenient way of including syndicated content into web pages - indeed the last 'S' of 'RSS' stands for 'syndication' in one of the two possible ways of expanding the acronym.
The obvious way to include the content of a feed in a dynamically-generated web page (such as the 'News' box on the University's current home page) is to include in the code that generates the page something that retrieves the page's feed data, parses it, and then marks it up and includes it in the page.
But this obvious approach comes with some drawbacks. Firstly the process of retrieving and parsing the feed may be slow and may be resource intensive. Doing this on every page load may slow down page rendering and will increase the load on the web server doing the work - it's easy to forget that multiple page renderings can easily run in parallel if several people look at the same page at about the same time.
Secondly, fetching the feed on every page load could also throw an excessive load on the server providing the feed - this is at least impolite and could trigger some sort of throttling or blacklisting behaviour.
And thirdly there's the problem of what happens if the source of the feed becomes unreachable? Unless it's very carefully written the retrieval code will probably hang, waiting for the feed to arrive, probably preventing the entire page from rendering and giving the impression that you site is down, or at least very slow. And even if the fetching code can quickly detect that the feed really isn't going to be available (and doing that is harder than it sounds), what do you then display in your news box (or equivalent)?
A better solution is to separate out the fetching part of the process from the page rendering part. Get a background process (a cron job, say, or a long ruining background thread) to periodically fetch the feed and cache it somewhere local, say in a file, in a database, or in memory for real speed. While it's doing this it it might as well check the feed for validity and only replace the cached copy if it passes. This process can use standard HTTP mechanisms to check for changes in the feed and so only transfer it when actually needed - it's likely to need to remember the feeds last modification timestamp from every fetch to make this work.
That way, once you've retrieved it once you'll always have something to display even if the feed becomes unavailable or the content you retrieve is corrupt. It would be a good idea to alert someone if this situation persists, otherwise the failure might go un-noticed, but don't do so immediately or on every failure since it seems common for some feeds to be at least temporally unavailable. Since the fetching job is parsing the feed it could store the parsed result in some easily digestible format to further reduce the cost of rendering the content into the relevant pages.
Of course this, like most caching strategies, has the drawback that there will now be a delay between the feed updating and the change appearing on your pages - in some circumstances the originators of feeds seem very keen that any changes are visible immediately. In practice, as long as they know what's going on they seem happy to accept a short delay. There's also the danger that you will be fetching (or at least checking) a feed that no longer used or very rarely viewed. Automatically keeping statistics on how often a particular feed is actually included in page would allow you to tune the fetching process (automatically or manually) to do the right thing.
If you can't do this, perhaps because you are stuck with a content management system that insists on doing things it's way, then one option might be to arrange to fetch all feeds via a local caching proxy. That way the network connections being made for each page view will be local and should succeed. Suitable configuration of the cache should let you avoid hitting the origin server too often, and you may even be able to get it to continue to serve stale content if the origin server becomes unavailable for a period of time.
See also Doing RSS right (2) - including content and Doing RSS right (3) - character encodings.
The obvious way to include the content of a feed in a dynamically-generated web page (such as the 'News' box on the University's current home page) is to include in the code that generates the page something that retrieves the page's feed data, parses it, and then marks it up and includes it in the page.
But this obvious approach comes with some drawbacks. Firstly the process of retrieving and parsing the feed may be slow and may be resource intensive. Doing this on every page load may slow down page rendering and will increase the load on the web server doing the work - it's easy to forget that multiple page renderings can easily run in parallel if several people look at the same page at about the same time.
Secondly, fetching the feed on every page load could also throw an excessive load on the server providing the feed - this is at least impolite and could trigger some sort of throttling or blacklisting behaviour.
And thirdly there's the problem of what happens if the source of the feed becomes unreachable? Unless it's very carefully written the retrieval code will probably hang, waiting for the feed to arrive, probably preventing the entire page from rendering and giving the impression that you site is down, or at least very slow. And even if the fetching code can quickly detect that the feed really isn't going to be available (and doing that is harder than it sounds), what do you then display in your news box (or equivalent)?
A better solution is to separate out the fetching part of the process from the page rendering part. Get a background process (a cron job, say, or a long ruining background thread) to periodically fetch the feed and cache it somewhere local, say in a file, in a database, or in memory for real speed. While it's doing this it it might as well check the feed for validity and only replace the cached copy if it passes. This process can use standard HTTP mechanisms to check for changes in the feed and so only transfer it when actually needed - it's likely to need to remember the feeds last modification timestamp from every fetch to make this work.
That way, once you've retrieved it once you'll always have something to display even if the feed becomes unavailable or the content you retrieve is corrupt. It would be a good idea to alert someone if this situation persists, otherwise the failure might go un-noticed, but don't do so immediately or on every failure since it seems common for some feeds to be at least temporally unavailable. Since the fetching job is parsing the feed it could store the parsed result in some easily digestible format to further reduce the cost of rendering the content into the relevant pages.
Of course this, like most caching strategies, has the drawback that there will now be a delay between the feed updating and the change appearing on your pages - in some circumstances the originators of feeds seem very keen that any changes are visible immediately. In practice, as long as they know what's going on they seem happy to accept a short delay. There's also the danger that you will be fetching (or at least checking) a feed that no longer used or very rarely viewed. Automatically keeping statistics on how often a particular feed is actually included in page would allow you to tune the fetching process (automatically or manually) to do the right thing.
If you can't do this, perhaps because you are stuck with a content management system that insists on doing things it's way, then one option might be to arrange to fetch all feeds via a local caching proxy. That way the network connections being made for each page view will be local and should succeed. Suitable configuration of the cache should let you avoid hitting the origin server too often, and you may even be able to get it to continue to serve stale content if the origin server becomes unavailable for a period of time.
See also Doing RSS right (2) - including content and Doing RSS right (3) - character encodings.
Labels:
Atom,
data feeds,
RSS,
web technologies
Thursday, 12 July 2012
Cookies and Google Analytics
Recent changes to the law as it relates to the use of web site cookies has focused attention on Google Analytics. If by some freak chance you haven't met Analytics, it's a free tool provided by Google that lets web site managers analyse in depth how their site is being used. It can do lots that simple log file analysis can't, and many web site managers swear by it.
Analytics uses lots of cookies, and there's quite a lot of confusion about it. In the UK, the Information Commissioner has been quite clear that cookies used for this sort of purpose don't fall under any of the exemptions in the new rules (see the final question in 'Your questions answered'):
Now that should be OK, because Google Analytics really does use first party cookies - they are set by JavaScript that you include in your own pages with a scope that means their data is only returned to your own web site (or perhaps sites, but still yours).
But there's a catch. The information from those cookies still gets sent to Google - it rather has to be, because otherwise there's no way Google can create all the useful reports that web managers like so much. But if they are first party cookies, how does that happen?
Well, if you watch carefully you'll notice that when to load a page that includes Google Analytics you browser requests a file called __utm.gif from a server at www.google-analytics.com. And attached to this request are a whole load of parameters that, as far as I can tell, largely include information out of those Google Analytics cookies. __utm.gif is a one pixel image, as typically used to implement web bugs. And the ICO is clear that:
But it's not all bad. There's some suggestion that Google do understand this and are committing not to be all that evil. For example here they explain that they use IP addresses for geolocation, and that "Google Analytics does not report the actual IP address information to Google Analytics customers" (though I note they don't mention what they might do with it themselves). They also say that "Website owners who use Google Analytics have control over what data they allow Google to use. They can decide if they want Google to use this data or not by using the Google Analytics Data Sharing Options." (though the subsequent link seems to be broken - this looks like a possible replacement).
Further, the Google Analytics Terms of Service have a section on 'Privacy' that requires (section 8.1) anyone using Analytics to to tell their visitors that:
So what do I think? My current, entirely personal view is that Google Analytics is probably OK at the moment, providing you are very clear that you are using it. It might also be a good idea to make sure you've disabled as much data sharing as possible. But I do wonder if the ICO's view might change in the future if he ever looks too closely at what's going on (or if someone foolishly describes it in a blog post...), so it might be an idea to have a plan 'B'. This might involve a locally-hosted analyics solution, or falling back to 'old fashioned' log file analysis. Both of these could still probably be supplemented by cookies, but they still wouldn't be exempt so you'd still need to get consent somehow. But this should be easier if they were truly 'first party' cookies and the data in them wasn't being shipped off to someone else. Trouble is, most good solutions in this area cost significant money. There is, as they say, no free lunch.
Analytics uses lots of cookies, and there's quite a lot of confusion about it. In the UK, the Information Commissioner has been quite clear that cookies used for this sort of purpose don't fall under any of the exemptions in the new rules (see the final question in 'Your questions answered'):
"The Regulations do not distinguish between cookies used for analytical activities and those used for other purposes. We do not consider analytical cookies fall within the ‘strictly necessary’ exception criteria. This means in theory websites need to tell people about analytical cookies and gain their consent."However he goes on to say:
"In practice we would expect you to provide clear information to users about analytical cookies and take what steps you can to seek their agreement. This is likely to involve making the argument to show users why these cookies are useful."and then says:
"Although the Information Commissioner cannot completely exclude the possibility of formal action in any area, it is highly unlikely that priority for any formal action would be given to focusing on uses of cookies where there is a low level of intrusiveness and risk of harm to individuals. Provided clear information is given about their activities we are highly unlikely to prioritise first party cookies used only for analytical purposes in any consideration of regulatory action."which looks a bit like a 'Get out of jail free' card (or 'Stay out of jail' card) for the use of at least some analytics cookies. The recent Article 29 Data Protection Working Party Opinion on Cookie Consent Exemption seems to have come to much the same conclusion (see section 4.3). They even suggest:
"...should article 5.3 of the Directive 2002/58/EC be re-visited in the future, the European legislator might appropriately add a third exemption criterion to consent for cookies that are strictly limited to first party anonymized and aggregated statistical purposes."Which is fine, but there's that reference to 'first party cookies' in both sets of guidance, and the reference to "a low level of intrusiveness and risk of harm to individuals"
Now that should be OK, because Google Analytics really does use first party cookies - they are set by JavaScript that you include in your own pages with a scope that means their data is only returned to your own web site (or perhaps sites, but still yours).
But there's a catch. The information from those cookies still gets sent to Google - it rather has to be, because otherwise there's no way Google can create all the useful reports that web managers like so much. But if they are first party cookies, how does that happen?
Well, if you watch carefully you'll notice that when to load a page that includes Google Analytics you browser requests a file called __utm.gif from a server at www.google-analytics.com. And attached to this request are a whole load of parameters that, as far as I can tell, largely include information out of those Google Analytics cookies. __utm.gif is a one pixel image, as typically used to implement web bugs. And the ICO is clear that:
"The Regulations apply to cookies and also to similar technologies for storing information. This could include, for example, Local Shared Objects (commonly referred to as “Flash Cookies”), web beacons or bugs (including transparent or clear gifs)." (emphases mine).So while the cookies themselves may be first party, the system as a whole seems to me to be more like something that's third party. And third party using persistent cookies into the bargain (some of the Analytics ones have a 2 year lifetime), and one that gets my IP address on every request.
But it's not all bad. There's some suggestion that Google do understand this and are committing not to be all that evil. For example here they explain that they use IP addresses for geolocation, and that "Google Analytics does not report the actual IP address information to Google Analytics customers" (though I note they don't mention what they might do with it themselves). They also say that "Website owners who use Google Analytics have control over what data they allow Google to use. They can decide if they want Google to use this data or not by using the Google Analytics Data Sharing Options." (though the subsequent link seems to be broken - this looks like a possible replacement).
Further, the Google Analytics Terms of Service have a section on 'Privacy' that requires (section 8.1) anyone using Analytics to to tell their visitors that:
"Google will use [cookie and IP address] information for the purpose of evaluating your use of the website, compiling reports on website activity for website operators and providing other services relating to website activity and internet usage. Google may also transfer this information to third parties where required to do so by law, or where such third parties process the information on Google's behalf. Google will not associate your IP address with any other data held by Google."which seems fairly clear (or as clear as anything you ever find in this area).
So what do I think? My current, entirely personal view is that Google Analytics is probably OK at the moment, providing you are very clear that you are using it. It might also be a good idea to make sure you've disabled as much data sharing as possible. But I do wonder if the ICO's view might change in the future if he ever looks too closely at what's going on (or if someone foolishly describes it in a blog post...), so it might be an idea to have a plan 'B'. This might involve a locally-hosted analyics solution, or falling back to 'old fashioned' log file analysis. Both of these could still probably be supplemented by cookies, but they still wouldn't be exempt so you'd still need to get consent somehow. But this should be easier if they were truly 'first party' cookies and the data in them wasn't being shipped off to someone else. Trouble is, most good solutions in this area cost significant money. There is, as they say, no free lunch.
Subscribe to:
Posts (Atom)

