data - ReadWriteWeb http://www.readwriteweb.com/feeds/tag/data en Copyright 2012 Richard MacManus readwriteweb@gmail.com Mon, 13 Feb 2012 13:30:00 -0800 http://www.sixapart.com/movabletype/?v=4.35-en http://blogs.law.harvard.edu/tech/rss Data Privacy: What Bill Gates Said 10 Years Ago DataPrivacyDayLogo.jpgToday is International Data Privacy Day, an event backed by companies like Intel, Ebay, Facebook and Microsoft, and dedicated to educating data owners about best practices in protecting the privacy of consumer data.

The need to keep people from being exploited on account of violations of their privacy is clear, well-known, intuitive and amply articulated by highly capable people. The up-side of making use of peoples' data is far less so. The two concerns are closely tied together. That's something Bill Gates is likely very aware of, if his comments 10 years ago are any indication.

]]> The forthcoming era of computing is all about data. In as much as that data is associated with people, it's essential that data owners feel secure in the belief that they can make use of their data in computing without concern it will be misused.

Bill Gates got this about the last era of computing, the first instances of e-commerce and the web. He wrote a famous company-wide memo ten years ago this month all about the importance of what a controversial hardware-based security paradigm called Trusted Computing.

"If we don't do this, people simply won't be willing -- or able -- to take advantage of all the other great work we do. Trustworthy Computing is the highest priority for all the work we are doing. We must lead the industry to a whole new level of Trustworthiness in computing."

Regarding Privacy in particular, the Gates memo put some things in ways we can relate to today, but other things seem antiquated.

"Users should be in control of how their data is used. Policies for information use should be clear to the user. Users should be in control of when and if they receive information to make best use of their time. It should be easy for users to specify appropriate use of their information including controlling the use of email they send."

Users should be in control of when and if they receive information to make best use of their time! Can you imagine that? Info overload as privacy violation. It makes sense, yet it seems hopelessly antiquated too.

"In the past, we've made our software and services more compelling for users by adding new features and functionality, and by making our platform richly extensible," he wrote.

"We've done a terrific job at that, but all those great features won't matter unless customers trust our software.

"So now, when we face a choice between adding features and resolving security issues, we need to choose security. Our products should emphasize security right out of the box, and we must constantly refine and improve that security as threats evolve."

Here's how the International Data Privacy Day organization puts it today.

"In this networked world, in which we are thoroughly digitized, with our identities, locations, actions, purchases, associations, movements, and histories stored as so many bits and bytes, we have to ask - who is collecting all of this data - what are they doing with it - with whom are they sharing it? Most of all, individuals are asking 'How can I protect my information from being misused?' These are reasonable questions to ask - we should all want to know the answers.

"Data Privacy Day promotes awareness about the many ways personal information is collected, stored, used, and shared, and education about privacy practices that will enable individuals to protect their personal information.

Robert Siciliano, an Online Security Evangelist at McAfee, paints a much more negative picture in a blog post yesterday - probably even about the companies participating in International Data Privacy Day. McAfee is owned by the primary sponsor of the event, though, Intel. Siciliano speaks for many people when he says:

"Lately, it seems that barely a day goes by when we don't learn about a major Internet presence taking steps to further erode users' privacy. The companies with access to our data are tracking us in ways that make Big Brother look like a sweet little baby sister.

"Typically when we hear an outcry about privacy violations, these perceived violations involve some apparently omnipotent corporation recording the websites we visit, the applications we download, the social networks we join, the mobile phones we carry, the text messages we send and receive, the places we go, the people we're with, the things we like and dislike, and so on.

"How do they do this? By offering us free stuff to consume online and infrastructure for the online communities that tie us together. We gobble up their technologies, download their programs, use their services, and mindlessly click 'I Agree' to terms and conditions we haven't bothered to read."

It's a cynical perspective that refers to all the glory of the Interwebs as simply free stuff to consume with mindless clicks.

I think I prefer the description Gates might have offered. The global computer is now rich with features and opportunities, but those will be put at risk if people don't trust the network. Please, Mr. Zuckerberg, don't spoil this opportunity.

]]> Discuss]]>
http://www.readwriteweb.com/archives/data_privacy_what_bill_gates_said_10_years_ago.php http://www.readwriteweb.com/archives/data_privacy_what_bill_gates_said_10_years_ago.php Data Services Sat, 28 Jan 2012 20:46:29 -0800 Marshall Kirkpatrick
Why Facebook's Data Sharing Matters Facebook has cut a deal with political website Politico that allows the independent site machine-access to Facebook users' messages, both public and private, when a Republican Presidential candidate is mentioned by name. The data is being collected and analyzed for sentiment by Facebook's data team, then delivered to Politico to serve as the basis of data-driven political analysis and journalism.

The move is being widely condemned in the press as a violation of privacy but if Facebook would do this right, it could be a huge win for everyone. Facebook could be the biggest, most dynamic census of human opinion and interaction in history. Unfortunately, failure to talk prominently about privacy protections, failure to make this opt-in (or even opt out!) and the inclusion of private messages are all things that put at risk any remaining shreds of trust in Facebook that could have served as the foundation of a new era of social self-awareness.

]]> FBPolitico.jpg

We, ok I, have long argued here at ReadWriteWeb that aggregate analysis of Facebook data is an idea with world-changing potential. The analogy from history that I think of is about Real estate Redlining. Back in the middle of the last century, when US Census data and housing mortgage loan data were both made available for computer analysis and cross referencing for the first time, early data scientists were able to prove a pattern of racial discrimination by banks against people of color who wanted to buy houses in certain neighborhoods. The data illuminated the problem and made it undeniable, thus leading to legislation to prohibit such discrimination.

I believe that there are probably patterns of interaction and communication of comparable historic importance that could be illuminated by effective analysis of Facebook user data. Good news and bad news could no doubt be found there, if critical thinking eyes could take a look.

"Assuming you had permission, you could use a semantic tool to investigate what issues the users are discussing, what weight those issues have in relation to everything else they are saying and get some insights into the relationships between those issues," writes systemic innovation researcher Haydn Shaughnessy in a comment on Forbes privacy writer Kashmir Hill's coverage of the Politico deal. "As far as I can see people use sentiment analysis because it is low overhead; the quickest, cheapest way to reflect something of the viewpoints, however fallible the technique. Properly mined though you could really understand what those demographics care about."

Several years ago I had the privilege to sit with Mark Zuckerberg and make this argument to him, but it doesn't feel like the company has seized the world-changing opportunity in front of it.

Facebook does regularly analyzes its own data of course. And sometimes it publishes what it finds. For example, two years ago the company cross referenced the body of its users' names with US Census data that tied last names and ethnicity. Facebook's conclusion was that the site used to be disproportionately made up of White people - but now it's as ethnically diverse as the rest of America. Good news!

But why do we only hear the good news? That millions of people are talking about Republican Presidential candidates might be considered bad news, but the new deal remains a very limited instance of Facebook treating its user data like the platform that it could be.

It could be just a sign of what's to come, though. "This is especially interesting in terms of the business relationships--who's allowed to analyze Facebook data across all users?" asks Nathan Gilliatt, principal at research firm Social Target and co-founder of AnalyticsCamp. "To my knowledge, they haven't let other companies analyze user data beyond publicly shared stuff and what people can access with their own accounts' authorization. This says to me that Facebook understands the value of that data. It will be interesting to see what else they do with it."

I've been told that Facebook used to let tech giant HP informally hack at their data years ago, back when the site was small and the world's tech privacy lawyers were as yet unaroused. That kind of arrangement would have been unheard of for the past several years, though. Two years ago, social graph hacker Pete Warden pulled down Facebook data from hundreds of millions of users, analyzing it for interesting connections before planning on releasing it to the academic research community. Facebook's response was assertive and came from the legal department. Warden decided not to give the data to researchers after all. (Disclosure: I am writing this post from Warden's couch.)

"Like a lot of Facebook's studies, this collaboration with Politico is fascinating research, it's just a real shame they can't make the data publicly available, largely due to privacy concerns" bemoans Warden. "Without reproducability, it loses a lot of its scientific impact. With a traditional opinion poll, anyone with enough money can call up a similar number of people and test a survey's conclusions. That's not the case with Facebook data."

"Everyone is going 'gaga' over the potential for Facebook," says Kaliya Hamlin, Executive Director of a trade and advocacy group called the Personal Data Ecosystem Consortium.

"The potential exists only because they have this massive lead (monopoly) so it seems like they should be the ones to do this.

"Yes we should be doing deeper sentiment analysis of peoples' real opinions. But in a way that they are choosing to participate - so that the entities that aggregate such information are trusted and accountable.

"If I had my own personal data store/service and I chose to share say my music listening habits with a ratings service like Neilson - voluntarily join a panel. I have full trust and confidence that they are not going to turn on me and do something else with my data - it will just go in a pool.

"Next thing you know Facebook is going to be selling to the candidate the ability to access people who make positive or negative comments in private messages. Where does it end? How are they accountable and how do we have choice?"

Not everyone is as concerned about this from a privacy perspective. "There are many things in the online world that give me willies for Fourth-Amendment-like reasons," says Curt Monash of data analyst firm Monash Research. "This isn't one of them, because the data collectors and users aren't proposing to even come close to singling out individual people for surveillance."

Monash's primary concern is in the quality of the data. "There's a limit as to how useful this can be," he says. "Online polls and similar popularity contests are rife with what amounts to ballot box stuffing. This will be just another example. It is regrettable that you can now stuff an online ballot box by spamming your friends in private conversation."

It doesn't just have to be about messages, though. Social connections, Likes and more all offer a lot of potential for analysis, if it's done appropriately.

"We need trust and accountability frameworks that work for people to allow analysis AND not allow creepiness," says Hamlin.

Two years ago social news site Reddit began giving its users an option to "donate your data to science" by opting in to have activity data made available for download. Massive programming Question and Answer site StackOverflow has long made available periodic dumps of its users' data for analysis. "You never know what's going to come out of it," StackOverflow co-founder Joel Spolsky says about analysis of aggregate user data.

The unknown potential is indicitive not just of how valuable Facebook data is, but potentially of the relationship between data and knowledge generally in the emerging data-rich world.

That's the thesis of author David Weinberger's new book, Too Big to Know. "It's not simply that there are too many brickfacts [datapoints] and not enough edifice-theories," he writes. "Rather, the creation of data galaxies has led us to science that sometimes is too rich and complex for reduction into theories. As science has gotten too big to know, we've adopted different ideas about what it means to know at all."

The world's largest social network, rich with far more signal than any of us could wrap our heads around, could help illuminate emergent qualities of the human experience that are only visible on the network level.

Please don't mess up our chance to learn those things, Mr. Zuckerberg.

]]> Discuss]]>
http://www.readwriteweb.com/archives/why_facebooks_data_sharing_matters.php http://www.readwriteweb.com/archives/why_facebooks_data_sharing_matters.php Analysis Fri, 13 Jan 2012 19:21:33 -0800 Marshall Kirkpatrick
After Years of Missteps, Facebook's Timeline is an Epic Win Facebook's new Timeline profile feature is great, even if it is a little strange. It's narcissistic, but that's a big part of the fun of it, and I'm not sure that other peoples' timelines are nearly as interesting as mine is to me.

It's an incredibly feature-rich new type of social network profile. It's a re-imagination of what a profile can be. It makes me want to use Facebook more, to share more data with Facebook so that it can be preserved and displayed so nicely, years into the future. While other Facebook features have pushed users into posting publicly by default, or posted their activities from other places they didn't understand would become part of the public record, I think Timeline is a genuine value add to incentivize users to share more. I think it's great.

]]> FBTimeline.png
Data is at the heart of the Facebook Timeline, your data - about your life, about your activities as recorded on Facebook and about your social connections. The music you listen to, the places you go and the things you do. Insights and experiences built on top of data are going to be a big part of the future of human/computer interactions. Facebook Timelines are a great first look at that idea for hundreds of millions of people. They are also something that Twitter can never do, for both technical and cultural reasons.

It's one thing to see this data all in a News Feed as Facebook has long showed it, it's fundamentally different to see Yourself and Others presented like a work of art in this new Timeline layout.

By highlighting the content you've published that has received the most social engagement, in the form of comments and Likes, your Facebook Timeline takes its best shot at presenting your Best Self to the world. The mundane updates are hidden in the background and the highlights of your life, if you posted about them on Facebook, are programmatically discoverable and now displayed in an attractive page layout.

It doesn't work perfectly, my Timeline says that I married my wife 3 times on 3 different dates, but generally speaking it works really well. It looks great on m.facebook.com too.

The Facebook Timeline represents the Instrumentation of Your Life, making things measurable and then building on top of those measurements. It's a big deal in the world of social software.

That Facebook launched such a bold new implementation of every user's data about themselves just months after getting slapped with a 20 year privacy audit requirement from the US government is bold.

As Not Seen on Twitter

Meanwhile, over on Twitter, that competing social network can't remember what you did two weeks ago. It does remember, it just won't let you remember. Historical content on Twitter is severely limited.

The company has said officially that's because Twitter is all about the here and now, it's real-time. Unofficially it's said though that the root of the problem was in a series of database creation decisions that were made years ago. It would now be super expensive to change that.

There is something about Twitter that's more conversational, more News focused and less conducive culturally to something like Timeline.

For the vast majority of its users, I'd also guess that Twitter accounts post fewer messages and get fewer responses that can be measured to determine highlights than is the case on Facebook.

Facebook also has a lot of structured data in the user's profile and changes to that become events, which social activity swarms around and which then become notable points in your life. You changed your marital status? That's probably going to get a lot of discussion. There is no equivalent on Twitter. Were Twitter to highlight your biggest tweets, they would likely be the wittiest quips you've made over the years, not the real life events.

Twitter is working on convincing people that tweets are great for reading, that it's largely a reading experience. Facebook, on the other hand, has always wanted you to share, share, share.

Many of us are doing things outside of Facebook, though. A lot of that is being shared back into our Newsfeed, but not all of it. I am very impressed with what Facebook has done, but I wish there was some more effective competition out there. There are various startups who have tried to do this, though none anywhere near as well as Facebook's hired and acquired team of world-beating design pros.

I joined Facebook 5 years ago this Fall, according to my Timeline. It's cool to see all that history presented so nicely and it makes me want to put more content into Facebook so I can see it later. I imagine that's the point.

]]> Discuss]]>
http://www.readwriteweb.com/archives/after_years_of_missteps_facebooks_timeline_is_an_e.php http://www.readwriteweb.com/archives/after_years_of_missteps_facebooks_timeline_is_an_e.php Data Services Fri, 16 Dec 2011 09:05:15 -0800 Marshall Kirkpatrick
It's Carrier IQ's World, We Just Live in It Somewhere along the complex supply chain of the mobile world's chips, antennas, touchscreens, operating systems and inter-linked celular networks traveling around the globe - someone has been caught capturing and transmitting more of your data than you'd probably like. There are probably any number of parties doing something similar but mobile usage data capture service Carrier IQ has been found to have code installed, with the phone companies' blessing, on millions of phones without the knowledge of consumers.

We're all awash in a sea of data, we have been for some time, but as we meet that data we learn that it is made of people. We've met the data tsunami and it is us. That's bound to make a lot of people uncomfortable. If a future based on that data unfolds in the wrong way, it could end up a major hindrance to the quality of human life.

]]> Identity data advocate Kaliya Hamlin warns of "participatory totalitarianism" - a future where freedom of choice and personal expression is squashed by a panopticon we build ourselves using our own technology. It doesn't have to be that way, though. An alternative future can be built based on personal sovereignty and effective policies and standards. The choice is ours, but we need to look beyond the initial fear of being tracked. The Carrier IQ controversy is worth discussing far beyond the actions of this one company alone.

What is Carrier IQ? It's software that delivers data about peoples' cell phone use to the cellular network carriers. Dropped calls and call quality people can understand, when it comes down to app usage patterns and individual keystrokes, as it discovered last week the company is tracking and transmitting, that's data many people feel very uncomfortable with.

Apple says it's stopped using Carrier IQ, but millions of Android phones continue to use it. Senator Al Franken has started asking questions.

"Don't Track Me, Bro!"

It's easy to understand why all of this makes people uneasy. I was just thinking about how cool the apparently semi-functional Jawbone Go personal data bracelets were last week when I thought, "but I don't need some futuristic Logan's Run style tracking fashion object around my wrist everywhere I go!" Then I looked down at the hand holding my beloved iPhone.

The future is already here. Our phones pump geo-tagged transaction data into the network at a rate that's 7,000 times the volume of all the blathering in the Twitter Firehose. Data is being understood, according to some leading analysts, as an economic input of equivalent importance to capital and labor. My phone lights up whenever I'm within 50 yards of a historically significant place off-line.

It's awesome and it's terrifying, both.

What's Black, White and Read All Over?

If the future of data is built well, though, then the upside for all of us is huge. The controversy around Carrier IQ, runs the risk of throwing a very precious baby out with the bathwater we're uneasily coming to understand. The ultimate question is not whether or not this data will be collected and used - the question is who will control that process? Will it be us, or will it be mysterious corporations we never knew existed?

It's your phone, it's your cloud tablet, it's the invisible framework that keeps the internet accessible and fast - our activities in the networked digital realm are almost always inherently measured and transmitted as a matter of course in delivering the services we love.

kvetonpiclaugh.jpgBut that doesn't mean that all tracking is done right. "It's astounding that a company thinks they can still get away with these always-connected devices," says Scott Kveton (right), CEO of mobile push notification infrastructure and mobile analytics service Urban Airship. "You have to always do the right thing when it comes to your product and services; thinking you can dupe or work outside of the regular rules of engagement is just plain nuts. Do you really need to do key logging to get network performance information? C'mon!"

Kaliya "Identity Woman" Hamlin (left, CC Doc Searls), Founder Personal Data Ecosystem Consortium, puts it this way.

I think all of this is a huge opportunity for the personal data ecosystem. Because clearly there is value in this data...but you can't get to it if you do it the way Carrier IQ appears to be getting it.

For one thing, it is totally out of alignement with European privacy law. In Europe they have purpose binding so you can collect data for 'a purpose' and you have to tell the user what it is and then keep the purpose with the data. It is illegal to store data with without the purpose binding.

kaliyasmilepic.jpgThe point is though, the data has value. It could be accessed ethically in new market places, oriented around people's control and management - not just this 'opt-in' to us stalking you. Put it in your personal data locker/store/vault/bank and use it as you see fit. Where the user can choose wehre they store it who can help them get value from it and how they are protected from others seeing and poking at it or manipulating and using it for things the user doesn't want.

This is also where accountability frameworks will start to come in - because right now there are really none asserted by people or anyone - but it is reasonable for a carrier to have data on where calls are dropping. So can you have 'frameworks' where that kind of data is available but not the Personally Identifiable Information and tracking bits...and can we audit this?

We want these systems and networks to get better...but 'trust us' isn't really going to work.

Messy, Secret, Private Freedom

Beyond the value of improved network performances and application features, Hamlin also emphasizes the need for people to control their own data and share it selectively in order for us to have the freedom to express different parts of ourselves in different contexts. If our whole lives are thrown into one big data bucket being peered into by robots from all over, that's going to constrict our freedom of movement and action.

Hamlin says that companies in this space are identifying your email, street adress, real name and from that are able to "look you up in the databases in the cloud that are tracking everyone and know all about you...without 1) having to ask you 2) respecting your different contexts you may not want linked 3) then they decide they know things about you that are 'inferred' from all that data...(the My TiVo thinks I'm gay problem writ large) and 4) has no sense of decency or relationship that is 'human'."

That all makes sense to me. I know I want the freedom to make decisions without robots lumping all the decisions I've ever made into one giant bucket without my permission. I'll happily share a lot of my data with people I trust and who deliver value to me. But it's not really Carrier IQ's world I live in, this is my life.

]]> Discuss]]>
http://www.readwriteweb.com/archives/its_carrier_iqs_world_we_just_live_in_it.php http://www.readwriteweb.com/archives/its_carrier_iqs_world_we_just_live_in_it.php Analysis Thu, 01 Dec 2011 22:08:31 -0800 Marshall Kirkpatrick
Spooked By Lax U.S. Data Privacy, European Firms Build Their Own Cloud Services A few recent legal developments affecting U.S. online privacy have rightfully troubled privacy advocates and civil libertarians on American soil. In addition to the Patriot Act's relaxed regulation of law enforcement's access to private data, recent court rulings have made it clear that U.S. authorities can secretly request data from tech companies without the user ever knowing.

If this seems objectionable from the standpoint of U.S. citizens, imagine how it looks to outsiders who are storing their data there. Some European companies who do business with U.S. technology companies are concerned enough to start looking elsewhere for infrastructure.

]]> Cloudnines and City Network are two Swedish firms that are trying to make the most of European discomfort with the state of online data privacy in the U.S. They're collaborating to build a database-as-service solution that is hosted on servers in Sweden, far from the prying eyes of U.S. law enforcement.

The new service allows companies to easily deploy and manage database instances in the cloud while still delivering products to consumers in such a way that complies with EU data protection laws.

A recent survey indicated that 70% of Europeans have concerns about their online data and how well companies secure it. A statement issued by two European politicians said that companies wishing to do business with consumers in Europe should abide by local data privacy laws, including social networks.

Cloudnines and City Network are pushing the privacy angle when marketing their services, as well as the notion that hosting data nearby (as opposed to across the pond) will improve latency and performance.

Considering growing concern over U.S. privacy developments, some of which are quite reasonable, we can realistically expect to see other firms in Europe and elsewhere follow suite with this type of branding effort.

]]> Discuss]]>
http://www.readwriteweb.com/archives/spooked_by_lax_us_data_privacy_european_firms_buil.php http://www.readwriteweb.com/archives/spooked_by_lax_us_data_privacy_european_firms_buil.php Cloud Computing Fri, 25 Nov 2011 11:45:11 -0800 John Paul Titlow
100 Years of Dance Music = Data With a Beat dancemap150.jpgThe travel geeks at Thomson have created a data visualization you can dance to. They tracked the top-level dance genres over the past century, and expressed the data as an animated map that moves from parent genre to descendant, proliferating over time.

The mapmakers used data from the books Bass Culture, Last Night a DJ Saved My Life and The All Music Guide to Electronica, as well as Wikipedia. They marked the birth of each genre in five year periods. As well researched as it might be, the exercise wasn't without controversy, however.

]]> Musical taxonomy is far from an exact science. Everything from your culture and geography to your age and personal tastes can affect how you draw lines of influence from one type of music to another. Thomson acknowledges that, asking for comments on the blog post where they debuted the map. The comments swing wildly back and forth from intriguing to goofy but are definitely worth reading. (For no other reason that seeing someone get really mad at the definition of a dance genre is super funny.)

Thomson blogger Osman Khan introduced the map as an incentive for travelers, Thomson's clients.

"Music tourism (visiting a city or town to see a gig or festival) is on the rise. But why stop at gigs and festivals? Why not visit the birthplace of your favourite genre and follow the actual journey various music genres have taken as one style developed into another."

I believe his inspiration was simpler than that. I believe Khan & Co. simply like to boogie-oogie-oogie 'til they just can't boogie no more. Just a theory, of course.

Other sources: okayafrica

]]> Discuss]]>
http://www.readwriteweb.com/archives/100_years_of_dance_music_data_with_a_beat.php http://www.readwriteweb.com/archives/100_years_of_dance_music_data_with_a_beat.php Music Tue, 08 Nov 2011 11:30:00 -0800 Curt Hopkins
New 5 Billion Page Web Index with Page Rank Now Available for Free from Common Crawl Foundation commoncrawllogo.jpgA freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2, was announced today by the Common Crawl Foundation. "It is crucial [in] our information-based society that Web crawl data be open and accessible to anyone who desires to utilize it," writes Foundation director Lisa Green on the organization's blog.

The Foundation is an organization dedicated to leveraging the falling costs of crawling and storage for the benefit of "individuals, academic groups, small start-ups, big companies, governments and nonprofits." It's lead by Gilad Elbaz, the forefather of Google AdSense and the CEO of data platform startup Factual. Joining Elbaz on the Foundation board is internet public domain champion Carl Malamud and semantic web serial entrepreneur Nova Spivack. Director Lisa Green came to the Foundation by way of Creative Commons.

]]> The Foundation explains the scope of the project thusly.

"Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster.

"Luckily for us, Amazon's EC2/S3 cloud computing infrastructure provides us with both a theoretically unlimited storage capacity coupled with localized access to an elastic compute cloud."

The organization was formed three years ago, just now started talking about itself publicly and believes that free access to all this information could lead to "a new wave of innovation, education and research."

Open Web Advocate James Walker agrees: "An openly accessible archive of the web - that's not owned and controlled by Google - levels the playing field pretty significantly for research and innovation."

]]> Discuss]]>
http://www.readwriteweb.com/archives/common_crawl_foundation_announces_5_billion_page_w.php http://www.readwriteweb.com/archives/common_crawl_foundation_announces_5_billion_page_w.php Data Services Mon, 07 Nov 2011 15:42:48 -0800 Marshall Kirkpatrick
Government Requests For Google User Data Keep Rising google150150.gifGoogle has updated its Government Requests tool with data from the first half of this year. For the first time, the report discloses the number of users or accounts specified, not just the number of requests. Google also made the raw data behind government requests available to the public.

Google launched its interactive transparency report last year. U.S. requests for google user data have spiked in the past six months, and Google complies 93% of the time. Google's transparency efforts have displeased some governments, but its compliance with requests have upset some civilians, too. In this increasingly weird new world, Google can only err on the side of more transparency while pushing for better laws.

]]> Google has tons of data about us. That's its business. We are its products - good little data-makers - which Google sells to its customers, the advertisers. As a huge storehouse of data about the public, Google is one of many Web services from which the federal government regularly requests user data. Social media investigations are the new wiretap.

How a Web company responds to government requests is a public test of its values, and users should take note. As we saw this month, Google handed over a WikiLeaks volunteer's Gmail contacts and IP address in response to a court order under the Electronic Communications Privacy Act of 1986, which allows the government to demand this information without even notifying the user.

google_government_requests.jpg

Electronic communications have changed a bit since 1986. They form a ubiquitous, always-on fabric of our lives now. Fortunately, Google isn't any happier with the status quo than privacy-aware users are. It's among a number of major Web companies pushing for better laws. And Google and other data-mining companies take their roles in public policy seriously. Both Google and Facebook's lobbying efforts broke records this year.

Do you think Google is doing a good job on transparency? Sound off in the comments.

]]> Discuss]]>
http://www.readwriteweb.com/archives/google_releases_data_about_gov.php http://www.readwriteweb.com/archives/google_releases_data_about_gov.php Google Tue, 25 Oct 2011 08:37:00 -0800 Jon Mitchell
First Look: The Web's Most Ambitious Personal Data Project, Singly, Goes Live Today singlylogo3.jpgYou make data. A lot of it. From Web browsing to link sharing to photos published online, from phone bills to medical records to online banking - almost all of us produce an incredible amount of electronic data that slips right through our fingers - often into the gaping maw of a corporate world without our best interests in mind.

What if you could easily capture all that data yourself, though? What if you could use it like fuel for apps built for you to view, sort and take action based on all that data? What if you could offer selective access to outside parties to that data? That's the vision behind the Locker Project, an open source personal data platform, and Singly, its corporate partner for hosted installs of the data lockers. Singly 1.0 launches to developers today at the Web 2.0 Summit (live at 2:40 PST). It's got financial backing from the leaders of WordPress, TechStars and multiple VC firms and a knock-out team of famous developers building it. What does it look like? Check out the first screen shots below.

]]> Screens at bottom of page.

Rock Stars Seek to Put You on Stage

We first wrote about Jeremie Miller - who created the Instant Messaging technology that almost all IM clients use in the world today - and his team building the Locker Project and Singly in February. (Creator of Instant Messaging Protocol to Launch App Platform for Your Life) Since then, Singly, from the corporate front to the non-profit Locker Project (like WordPress.com, or Automattic, is to WordPress.org), has expanded its team and hired a number of other well-known technologists. Most notably Matt Zimmerman, former CTO of Ubuntu.

Investors include David Pakman (Venrock partners), True Ventures (Toni Schneider, CEO of WordPress), David Cohen (CEO of TechStars), David Tisch (TechStars NYC), Josh Felser and Dave Samuel (Freestyle Capital), Tim Connors (PivotNorth Capital) and Kal Vepuri. Public advocates of what Singly is doing range from Tim O'Reilly and John Battelle to the CTO of Best Buy and Clay Shirky.

All of these people are advocating that, as John Battelle wrote yesterday about Singly, "we have to start taking control of our own identity and data," because of the "value and benefits that will accrue to us and to society in a culture that values individual control of data."

Here's What's Going Live Today

Singly 1.0 begins rolling out to developers today. Those first users will be able to build apps that search, sort and visualize contacts, links and photos that have been published by their own accounts on various social networks but also by all the accounts they are subscribed to there. Want to search the contents of every link shared by every person you're subscribed to on Twitter (at least as far back as Singly can access)? Want to make a slideshow of all the Instagram photos your contacts have posted that have a certain hashtag in them? Or were on a weekend? Or whatever other criteria you can think of? Those kinds of things are possible now.

The apps will live on Github and will deploy on Singly, for now. You'll be able to use someone else's app, but only to visualize your own data. You can't yet ask for permission to see someone else's data. Below, a dashboard view of data being searched and then a selection of viewing apps for photos. Click the images to view them full screen.

singlypage1.jpg

singlypage2.jpg

What you see above is a few different apps. Right now they all live in Github and are pointed to from a wiki. More and more functionality will be brought directly into Singly in time, though, and by the next quarter the company says that front end developers will be able to write visualization apps directly against the platform with no need to build their own back end processing capabilities.

In the near term future, app developers will be able to access consumer-grade health tracking data from devices like Fitbit and financial data in the form of emailed receipts. Singly will build connectors, clean up and structure the data for apps to be built on top of.

Will people care? Will developers be able to build apps so compelling that end users will hook up Lockers to capture their data? It might be too geeky, but a whole lot of very smart people are putting their heads together to make it real.

]]> Discuss]]>
http://www.readwriteweb.com/archives/singly_platform_launch.php http://www.readwriteweb.com/archives/singly_platform_launch.php Data Services Wed, 19 Oct 2011 14:00:00 -0800 Marshall Kirkpatrick
Report: 7% of U.S. Web Traffic From Handheld Devices ipad_hands_150x150.jpgAccording to new data from comScore, 6.8% of Web traffic in the U.S. comes from "non-computer" devices such as smartphones and tablets. This is an increase from 6.2% in the previous quarter.

Phones account for the majority of non-computer traffic. Mobile devices drive 4.4% of total digital traffic, tablets contribute 1.9%, and other non-computer devices send 0.5% of traffic.

]]> Digital-Omnivores_Data-Gem-2_U.S.jpeg

The comScore data come from a recent report entitled Digital Omnivores: How Tablets, Smartphones and Connected Devices are Changing U.S. Digital Media Consumption Habits. The white paper is available for free (with registration) from comScore's website.

We reported earlier this year that worldwide mobile data traffic is expected to increase 26-fold to 75 exabytes per year (!) by 2015. That's 19 billion DVDs, just to give you a sense. To put it another way, that's 75 times the size of the entire Internet in the year 2000. The mobile revolution is underway, and it behooves those who make Web content to get onboard.

How do you split up your Web use between desktop/laptop, mobile and tablets? Tell us in the comments.

]]> Discuss]]>
http://www.readwriteweb.com/archives/report_7_of_us_web_traffic_comes_from_handheld_dev.php http://www.readwriteweb.com/archives/report_7_of_us_web_traffic_comes_from_handheld_dev.php Mobile Mon, 10 Oct 2011 14:30:00 -0800 Jon Mitchell
Bankers Go Bonkers Over Big Data's Future stratalogo.jpgWith quadrillions of dollars on the line, banks and financial institutions pay close attention to the emerging exaflood of available data about their customers and the world around them. Here at the O'Reilly Strata conference on big data, the panel on big data in the banking world was fascinating. It's likely an indication of the way the rest of the world is likely to move in the near future - at least if you believe the predictions of the people on the panel.

Huge opaque markets are about to become transparent because of new regulations and that means a whole lot of new data available for analysis. Scalable processing of that data will require outsourcing, giving birth to new industries. Millions of people will need to be trained to deal with all this. Below, my notes from this fascinating panel discussion.

]]> Can big data help banks avert the next financial crisis? Could regulation resulting from the last crisis yield newly available data that could become new mega-resources for innovation themselves? Those were among the topics discussed.

Big Data In Banking

Moderated by:
Abhishek Mehta (Tresata)
Panelists:
Roy E. Lowrance (New York University), Richie Prager (BlackRock), Allen Weinberg (McKinsey)
11:20am Wednesday, 09/21/2011

My notes...

Richie Prager: Dodd Frank Consumer Protection Act changes the OTC derivatives market. It was the domain of large financial institutions, measured in hundreds of billions of notional dollars.

Everything in that market that was opaque will now have to execute contracts with transparency, all the executions will get turned into data and all the risk is reported to regulators. What was very opaque will now be a transparent market, reams of new data available.

Allen Weinberg (McKinsey): All kinds of businesses will need to learn to work with this data, new providers will need to emerge to serve them. They have to get outside of silos, it will be a dramatic opening up. Banks won't have enough capital to build it all themselves, it will have to be an open source model to build the infrastructure.

"We manage over $3 trillion and we believe in analytics so much that we've created a dedicated research staff. We actively use tools to understand everything from trading cost analytics to capacity of a certain trade, performance metrics for our traders. Now it's about how you create Alpha opportunities to outperform based on that data." - Richie Prager, BlackRock
Roy E. Lowrance (New York University): Data mining projects can take all day to think of an idea, then days or months to run. We need to figure out how to process it efficiently, with a consistent load. The financial data that is becoming newly available will need some smoothing and story extraction - that's something that will be outsources. Lots of outsourcing will happen.

Abhishek Mehta (Tresata): Do you think our industry will embrace mining of data outside our own walls?

Lawrence: As new data sets become available, everyone will need access to them, cost will become an issue - outsourced providers can scale and you can't.

Prager: Data providers are already here and big, third-party data is already key. It's a growing space.

Weinberg: New interesting companies include data cleansing, data management - and how do we get a single view of the customer? New names you haven't heard of will become household names very quickly just because of the serious need for their services.

Weinberg: Most of the companies in banking grew through silos and thus haven't had a good single view of customer. Even if you had the data in one place - how do we manage then to execute? If we had visibility and transparency into the exact credit on every distinct loan, if you can see the whole customer then it's no longer a mystery and we can avoid the mistaken assumptions of the past.

Abhishek Mehta (Tresata) is moderating this conversation very well and is a model I should follow for moderating panels like this myself.

Prager: For institutional investors, we manage over $3 trillion and we believe in analytics so much that we've created a dedicated research staff. We actively use tools to understand everything from trading cost analytics to capacity of a certain trade, performance metrics for our traders. Now it's about how you create Alpha opportunities to outperform based on that data.

Lowrance: You have to crunch the MBA stuff, figure out how to use advanced number crunchers, make use of them. Capital One had the very best analytics and they believed them. They executed based on that. That's the hard part.

Weinberg: It's well developed to see what the best data is, run it through intelligent analysis, though there is some concern about getting lost in the model. On the retail side, though, people are stuck to see where is the value? You still need to sell a single thing to a single customer. We need the ability to test quickly; small offers, tested quickly, then ramp up. We need to know what the total size of the pie is - then we'll see the money start pouring in. You get sophisticated, expensive tools and a lot of people sign up but don't use it a lot. First people will make a lot of money, then it will become commoditized, but we're at the beginning of that curve.

Lowrance: Key training opportunity is with middle managers. You have to get them trained and aware of the possibilities.

]]> Discuss]]>
http://www.readwriteweb.com/archives/bankers_go_bonkers_over_big_datas_future.php http://www.readwriteweb.com/archives/bankers_go_bonkers_over_big_datas_future.php Data Services Wed, 21 Sep 2011 10:10:20 -0800 Marshall Kirkpatrick
Life in the Future, With Data: Livestreaming O'Reilly's Strata Conference mchui.jpg"Big data enables new ways to create value, it's going to change the basis of competition," Michael Chui of the McKinsey Global Institute said this morning to kick off O'Reilly's big data conference, the Strata Summit. The next two days are all about the rise of information that has to be dealt with on scale, big data, and its consequences. "It will change the way companies, sectors and economies compete," says Chui.

McKinsey published an exhaustive 150 page report on big data this Spring, which argued that data will soon become an economic input as important as labor and capital. It's not just about pure economics, though. As Edd Dumbill, chair of Strata, put it today, our relationship with big data needs to serve humans - not turn humans into the servants of machines and information overload. "We know that big data can help us, it may be the case that big data has to help us." Below, a live video stream of the next two days' proceedings addressing this mega-opportunity and trend.

]]> ]]> Discuss]]>
http://www.readwriteweb.com/archives/life_in_the_future_with_data_livestreaming_oreilly.php http://www.readwriteweb.com/archives/life_in_the_future_with_data_livestreaming_oreilly.php Data Services Tue, 20 Sep 2011 06:35:39 -0800 Marshall Kirkpatrick
How Delicious Can be Saved What do you get when you collect and categorize the reading interests and intentions of millions of people exploring around the web? Fans of social bookmarking service Delicious have always believed you get a big win-win: bookmarkers are able to access links of interest them later, from any computer, and the rest of us get to watch from the outside and discover interesting new links in the wake of all that saving.

Delicious didn't really work out that well in the long run, though, and, five years after it was acquired, then neglected, by Yahoo, it was bought this spring by a team led by Youtube co-founders Chad Hurley and Steve Chen. Jenna Wortham of the New York Times caught up with the new company this weekend and reported on some of the thinking behind the forthcoming rebirth of Delicious. What it needs, I believe, is to be easier to use, more relevant and more attractive in design.

]]> Most everyone agrees that the biggest problem in growing Delicious has been its sparse, utilitarian design; something believed to turn off mainstream users when they come to the site. Chen and Hurley say that mainstreaming the site is one of their primary goals for the next version.

Below: A thing of beauty, but it could use pictures.
delcsa.jpg

In order to make Delicious appealing to a wider variety of people than the web tech tinkerers who have appreciated it to date, Chen and Hurley say they plan on turning it into a destination site with:

  • Topical "stacks" of multimedia content on particular topics, like a big event in the news.
  • Bundles of links curated regarding a particular topic, like planning a vacation to a particular place.
  • Personalized recommendations, hopefully based on aggregate data collected from the site and a user's own behavior. It's interesting; that's roughly related to what Delicious founder Joshua Schachter is now doing with his new site, Jig. It's very algorithm driven, under the covers.

That all sounds good. But I think it may need more. It's all about helping new and non-technical users, maybe users who are less likely to explore a complex website, to capture the network effects of everyone else's bookmarking with minimal work on their part. The old 90/10 rule may be applicable: if only 10% of the visitors to Delicious are actively bookmarking links themselves, and everyone is reading and searching, that could be a great turn of events. Hopefully a growing number of people will come to read and then 10% will convert and the ranks of the taggers will grow.

What I Think Delicious Will Need

I love Delicious. I think I've probably made use of it in ways that few people have (unfortunately) and have captured huge amounts of value from it. I really want it to thrive. Here's what I think the new team behind it ought to consider. In response to this post, former Delicious product manager Simon Davison said to me on Twitter this morning, "All of those [ideas] and 200+ more were included as a part of the internal wiki that came with Delicious."

  1. Delicious is really a search engine, in large part. People are arguably growing disillusioned with the search offerings of Google and Bing and Google is already looking to serve up "what you want before you know to ask for it." That's something Delicious could help with.

    The company has a huge collection of legacy bookmarks, links validated by a human intention to read them and manual assignment of topic categories. That backlog should be made use of. I tell people all the time, when they ask me questions about Web technologies, to go look it up on Delicious. You want to know about the Semantic Web? Go check out http://www.delicious.com/popular/semanticweb You want to know some cool things about Portland, Oregon? http://www.delicious.com/popular/portland is a good place to start. Delicious should build the capacity to find popular links with two tags for more sophisticated structured searches. For example, we should be able to search http://www.delicious.com/popular/portland+coffee and not just http://www.delicious.com/tag/portland+coffee (though that's cool too). Without having to look at URLs, which apparently most people can't be allowed to see lest they wet themselves, but being able to keep using those same URL structures if you're a grownup is important as well.

  2. There needs to be some passive tagging enabled. The whole tagging experience should be made smoother and users ought to be able to opt-into having some categorization done automatically. The bulk of Delicious bookmarks already in the archives can help inform an algorithm that does that. Requiring that people bookmark and tag everything is kind of a drag, though, and an unnecessary burden to impose on users. In a world of real-time search and sharing people simply don't go to the trouble of bookmarking links that often, they are easy enough to recall later if they are really important.

    I would happily allow Delicious to automatically bookmark all the links I open and even propose tags for them. How about a subtle little pop-up in the corner of my page that says "page bookmarked and tagged as..." with 2 or 3 tags applied automatically. I can click to remove any bad ones, nuke them all if they are all wrong, or click a button to do it manually and apply my own tags. That would be awesome. If Delicious itself doesn't build that interface, someone else ought to, on top of the API.

  3. Mobile saving and reading should be a much bigger part of Delicious than it is. How often do you find yourself on your phone with a few minutes of free time that could be good for reading links you saved back at home? How about finding links on your phone and saving them for later reading when back home? Both of these things happen all the time now - whereas they didn't happen at all when Delicious was born.

Those are a few of my ideas for how Delicious could be saved. I hope the new team can pull it off, however they go about trying. The idea of mass folksonomic categorization of the web, built on the data of casual web activity and served up for subsequent exploration and subscription is a beautiful, beautiful vision.

]]> Discuss]]>
http://www.readwriteweb.com/archives/how_delicious_can_be_saved.php http://www.readwriteweb.com/archives/how_delicious_can_be_saved.php Analysis Mon, 12 Sep 2011 08:13:14 -0800 Marshall Kirkpatrick
Visualizing the Local Effects of Recovery Spending on Job Loss [Interactive Map] recovery_map.pngIn the wake of U.S. President Obama's speech on jobs last night, we present this mapping of Recovery Act spending. Development Seed, the same folks who mapped the famine in the Horn of Africa, have turned their attention on America.

Development Seed has mapped Recovery Act spending on a county-by-county basis and compared it with county unemployment figures over the same time period. So, does government spending have a positive effect in job recovery? That would be telling and we're going to abide by the doctor's prescription not to tell when you can show. The map is after the jump.

]]>

The change in unemployment over the last year is reflected in the colors, with red indicating an increase and green indicating a lower unemployment rate, or job growth. Counties that received under $10 million in recovery funds show a white hash pattern. The counties with the most spending - about a third - are shown in solid colors.

Dave Cole discussed the results of the mapping on Development Seed's blog.

"Overall, it's impossible to tell for sure how much recovery spending improved the economic situation, because we just don't know how bad things could have been. It may be the case that without spending, this map would have a lot more red. Or maybe not. What's interesting here is the local impact and information we are able to see from processing a few sets of open data."

If you liked this, you may also be interested in the data visualization mapping Audrey Watters covers in her post for O'Reilly's Radar. That one maps U.S. job losses by location since 2004.

]]> Discuss]]>
http://www.readwriteweb.com/archives/visualizing_the_effects_of_recovery_spending_on_jo.php http://www.readwriteweb.com/archives/visualizing_the_effects_of_recovery_spending_on_jo.php Location Fri, 09 Sep 2011 11:30:00 -0800 Curt Hopkins
In a Bold Move Towards Accountability, Road Casualty Data Published Online in UK ukmap-1.jpgThe UK government has published 5 years of nation-wide road safety and casualty data freely online on a map that anyone can view in a web browser. It's a remarkable instance of data-driven public accountability; presumably citizens will use this newly accessible data to apply pressure on government agencies regarding safety improvements. Citizens and researchers will also be able to cross-reference the location of troubled roadways with race and class demographic analysis to illuminate any inequitable allocation of infrastructure resources. It's a bold and enabling action to take online.

The statistics were gathered by independent researchers and put online using eSpatial OnDemand GIS and Open Street Map. Open Street Map is like the Wikipedia of world and local maps, but it's also a popular data platform that many other applications make use of. Map nerds should watch the OpenStreetMap annual conference, State of the Map, for more exciting map and geodata news. The conference opened this morning in Denver, Colorado.

]]> Geodata industry writer Matt Ball published an eloquent, in-depth explanation this morning of how the geo industry is moving away from domination by legacy commercial software providers and toward a future where extensive value and opportunities are created by open source and open data communities working together on the web.

espatial.jpg

]]> Discuss]]>
http://www.readwriteweb.com/archives/in_a_bold_move_towards_accountability_road_casualt.php http://www.readwriteweb.com/archives/in_a_bold_move_towards_accountability_road_casualt.php Data Services Fri, 09 Sep 2011 09:14:48 -0800 Marshall Kirkpatrick