Source: A Tale of Two Companies
data-driven journalism. But despite all the discussion of the topic, there’s precious little documentation to guide practicing and future journalists in becoming proficient in it. The Data Journalism Handbook aims to fix that, albeit at a high level.It’s hard to pay attention to the business of journalism without hearing about data journalism or
The Data Journalism Handbook effort started at a workshop at the London MozFest 2011 last November. From there, the handbook represents the work of “an international, collaborative effort involving dozens of data journalism’s leading advocates and best practitioners.” This includes folks from ProPublica, The Washington Post, the BBC, The New York Times and many others.
The result, so far, is an online book that’s just now in beta. Eventually it will also be published in dead tree and e-book form by O’Reilly. However, given the nature of the tome, it’s most useful online. As you’d expect from a title that was born at a Mozilla conference, the text is full of links to online resources. I suspect trying to read the title as an e-book – or especially on paper – would be a little frustrating.
Inside the Handbook
The handbook offers a glimpse into the practice of data journalism, with some guidance on how to get started. You’ll find a slew of case studies, along with sections on getting data, understanding data and delivering data to the public.
The handbook covers topics like open data, data use rights, scraping and crowd-sourcing data, and community engagement. You’ll also find some high-level discussion of tools to work with open data, and how to get that data.
Most importantly, the book offers a resounding case for data-driven journalism. The case studies demonstrate the utility of data-driven journalism and the service that it offers the public. For instance, the OpenSpending.org example should inspire any journalist that covers politics and public funds. The Price of Water case study shows not only the service to the public, but the service of the public in gathering data.
The handbook is not a comprehensive guide to all of the concepts and skills that a journalist needs to practice data journalism. It doesn’t teach the skills necessary for data literacy, though it does provide some links to resources. It also, of course, explains the importance of data literacy. But it certainly doesn’t try to teach journalists how to program and make use of APIs, or how to use tools to create data visualizations.
In short, it’s not Big Data for Journalists or even Programming 101 for Journalists, and more’s the pity. Programming and working with data sets is a skill set that many journalists would do well to have, but most don’t. To be fair, the handbook doesn’t necessarily advocate that journalists be programmers. It does emphasize being able to work well with programmers, but it would probably be a very good idea to have at least a fair grasp of basic programming.
Tips and Ideas
If you read just part of the handbook, I’d recommend skipping the case studies and going straight to the meat of the book. Specifically, the sections on getting data, understanding data and delivering data.
For example, “Become Data Literate in 3 Simple Steps.” This piece advises journalists, at a high level, how to approach data. Ask yourself how the data was collected and if it can be tested. Don’t assume that data handed to you by a source is going to be valid. (And if the data is not valid, it may be a story, or it may defeat the premise of the story.) Question the data, how it was gathered and whether it’s a reliable sample. You see, for instance, many “trend” stories about technology based on a single data set. You may not have a large enough sample size to rely on.
The section on visualizing data is also useful. The handbook recommends that reporters working with data find a way to visualize it, even if that’s just pulling numbers into a spreadsheet. Visualizing data allows you to find patterns that you might otherwise miss.
In the enthusiasm for working with data, scraping websites or gathering data in other ways, there’s also the small matter of legal restrictions. Whose data is it, and do you have the right to distribute it? The “Using and Sharing Data” section advises reporters to consider the ownership and licensing of data, and when “database rights” might mean that you can’t distribute a data set in its entirety. It also covers various open-data licenses and recommends that news organizations apply those when distributing homegrown data sets.
An Unevenly Distributed Future
What the handbook also does, sadly, is provide a tantalizing picture of what is, and what should be. As William Gibson said, “the future is already here – it’s just not very evenly distributed.” The same can be said for data journalism. We have marvelous tools for doing data journalism, and they’re getting better all the time. In some newsrooms, journalists are producing solid work with in-house or open-source tools, examining everything from public data sets to data curated in-house.
In most newsrooms, however, reporting has not yet been significatnly affected by data journalism. In an era of continual layoffs and cutbacks, there’s no budget for training or tools to help reporters get up to speed with the necessary tools and practices. Most of the case studies describe projects that take weeks or months, a depressing concept for journalists tasked with writing several stories per day.
There’s a deep need for the handbook, and a sequel or two that dive deep into the actual practice of data-driven journalism. (To my friends at O’Reilly, a “programming for journalists” book would be a nifty title.) It’s inspiring and educational material, if less focused on “how-to” than one might like.
Data-driven journalism is in its infancy right now, despite the amount of discussion it’s generating. I suspect that it’s going to be five to 10 years before we’ll see the practices in the handbook becoming mainstream.
Image from the Data Journalism Handbook, which is available under the Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) in its entirety.
of Google Drive this week brings to the forefront the issue of privacy and the use of cloud storage services.The launch
It’s not that Google’s privacy policies are significantly better or worse than competing services, such as Microsoft SkyDrive, Apple iCloud, Dropbox, SugarSync and SpiderOak. It’s more that big-name vendors are making it so darn easy and cheap to store personal photos, documents and audio files that these issues now threaten to affect a huge number of users.
For example, Google offers 5GB of free storage and Microsoft 7GB, so why not take advantage of the convenience of having content in the cloud and being able to share it with anyone? Well, there is no reason, as long as you know the risks.
In general, the privacy policies for all the service providers are similar, as The Washington Post points out. Vendors acknowledge they don’t own the data and promise they won’t access it, other than to operate their services. The latter is important to the companies, because they need to copy and move files and folders around their servers in order to provide backup and file sharing and to develop new services.
The services do have some differences, though:
Microsoft SkyDrive: Microsoft’s terms of service have gotten kudos for favoring plain language over legalese. In general, the terms of service give Microsoft the same rights and have the same limitations as Google’s.
Apple iCloud: Apple goes a step further than the rest in censorship. The company has the right to delete – without prior notification – any content stored that it finds “objectionable.” Apple doesn’t say how it decides whether content is fit for iCloud.
Dropbox: Unlike Google, Microsoft and Apple, Dropbox’s business lies only in cloud storage and file sharing. Nevertheless, its terms of service tend to use language that is more vague, which could be interpreted as being more expansive in terms of its rights.
The truly slippery issue, though, isn’t the services’ own policies, but how they deal law enforcement, government agencies and lawyers in civil cases. All the storage providers say they will hand over files if required to by law, but they don’t commit to telling affected customers. This makes it possible for vendors to follow law enforcement requests to keep their actions secret, but is a red flag for privacy advocates.
“We advocate for these hosts to have a really transparent policy and to notify people when their information is requested,” said Rebecca Jeschke, spokeswoman for the Electronic Frontier Foundation (EFF), a San Francisco-based advocacy group for digital rights.
Encryption is the Key
SpiderOak gets around this dilemma by encrypting data and handing the key to customers. Because SpiderOak can’t decrypt the data, the customer has to be notified by default. The other vendors listed above also encrypt data, but retain the ability to decrypt it.
Another gray area is in copyright protection. As this year’s demise of file-sharing site Megaupload showed, law enforcement can move quickly to take an operation offline and arrest its founders, if there’s strong evidence that the site is being used to share lots of copyrighted material. When that happens, everyone who stores files on that service is affected, whether or not they are even suspected of copyright infringement.
All the cloud storage providers let users share their content with others, so the possibility of copyright violation is ever-present. The question is whether this could become enough of a problem to draw the attention of the entertainment industry or other groups intent on protecting copyright. Storage providers who lack sufficient mechanisms for preventing copyright violations could meet the same fate as Megaupload, leaving innocent users unable to access their own data. And it’s not entirely clear what measures would be considered sufficient.
May I See Your ID?
To avoid such problems, cloud storage providers could one day implement some kind of identification system to look for copyrighted material, similar to what Google already does on YouTube. The EFF hopes vendors tackle this problem on their own to avoid government requirements that could prove too onerous for startups.
“If YouTube had to have something in place like that [a content ID system] right away, it might never have existed,” Jeschke said. “Requiring all this overhead before companies can innovate would be problematic.”
The bottom line is trust. Whatever the written policies, when selecting a cloud storage provider, consumers and companies should first decide whether they believe the vendor can be trusted to do everything possible to protect customers’ privacy.
Images courtesy of Shutterstock.