Crowdsourcing and Data Validation

I had an engaging conversation yesterday with my colleague Lokman Tsui from Harvard’s Berkman Center. Lokman was interested in learning more about Ushahidi. The conversation touched on the topic of data validation, an ongoing challenge in the field of conflict early warning/response (and indeed many fields that comprise information collection).


I have heard many criticisms about the perceived lack of rigorous data validation in crowdsourcing excercises like Ushahidi‘s. I am starting to find the repetitive nature of these criticism somewhat amusing. (Note, Lokman himself was not criticizing, but rather asking perceptive and informed questions, which is far more conducive to a fruitful conversation).

What I find amusing about the repeated nature of criticisms vis-a-vis Ushahidi and data validation is how elementary they tend to be. The critics  seem to assume that the good folks at Ushahidi haven’t given any thought about the challenges of data quality control. I’d really like to know what the basis for such assumptions is.

When I first got in touch with team in early January 2008, it was perfectly clear that they were already thinking about data validation. They were equally serious about learning as much as they could from the field of conflict early warning/response in order to improve their future efforts in this regard.

This hasn’t stopped the critics from repeating their disapproving statements about Ushahidi‘s approach. The purpose of this post is to try and move the conversation on data validation forward with the hope that the discourse on this topic will cease to sound like a broken record.

A former professor of mine at Columbia University encouraged us to address problems by formulating the following questions: (1) What is the question? (2) Compared to what? (3) According to who?

What is the question?

How does Ushahidi carry out data validation? When Ushahidi receives a new alert, the team can validate the information with any available reports from the news media. The team can also contact the individual who reported the alert to ask for further details on the event.

Lets face it, if someone sends in an alert and they subsequently get a call from Ushahidi, determining whether that person is fabricating information isn’t impossible. A few  questions asking for specific details will make it apparent whether or not the person is lying.

Compared to what?

How do other initiatives carry out data validation? Take, for example, the Conflict Early Warning and Response Network (CEWARN), a regional inter-governmental initiative in the Horn of Africa has a hierarchical two-step validation protocol.

Incident reports (alerts) are submitted to Country Coordinators (CC) by Field Monitors (FM). If CC’s find a report questionable, they will ask FMs for more information to validate the report. CC’s then submit the report to CEWARN Head Quarters. If analysts at HQ have concerns about the validity of a report, they communicate them to the CC who will in turn request further information from the FM.

Here’s the catch, though: Field Monitors are based in rather remote and rural locations while Country Coordinators are based in capital cities.  Both are only employed part-time. HQ analysts are based in Addis Ababa. The system does not make use of SMS nor GPS coordinates. Having worked on the implementation of CEWARN, I can confirm that the reporting and data validation process took between 2-4 weeks.

Surely Ushahidi’s approach is more efficient given the direct link between the person who submitted the alert and the Ushahidi team.

Another example of data validation is that of the mainstream media. Professional journalists are required to have credible sources and to crosscheck their sources before publishing a news article. Is Ushahidi‘s approach that different?

According to whom?

Who is asking the question? Or rather, who is doing the criticizing? Those questioning the data quality of Ushahidi alerts are more often than not academics and/or Westerners. These individuals expect a level of data quality that matches what they have in the US and Europe, where data validation processes are more institutionalized (given that they’ve been around for longer).

These critics assume that they are the intended users of the Ushahidi platform. Isn’t that a little egocentric? The purpose of crowdsourcing crisis information a la Ushahidi is to increase the situational awareness of those who find themselves facing escalating social tensions and violent conflict so they can make better decisions about how to get out of harm’s way. (Please see my previous post on Ushahidi-DRC for context).

Lets place ourselves in their shoes. If an armed rebel group is moving towards our small rural village in Eastern DRC, wouldn’t we want to know even if the information was unconfirmed? Wouldn’t we want to know so we could at least take some precautionary measures or at least think about how to determine whether the alert was credible? I think we would.

Incidentally, Ushahidi should tag their alerts as either “credible” or “unconfirmed” so that endusers subscribed to the alerts can at least get a sense of how reliable the alerts are.

None of this implies that Ushahidi is perfect, but then again, no one at Ushahidi claims it is. So I hope the critics give it a rest and join in the constructive brainstorming.


12 responses to “Crowdsourcing and Data Validation

  1. Great post and I’m so happy someone has done it. Honestly, this has been a concern of mine as well. I know that anonymity is a HUGE priority for the folks at Ushahidi. While I claim to be no expert in this field, like you are, I’m going to take a stab at a bit of critque if I may…

    What’s the Question:
    1. The whole point in this system and crowd sourcing is that there are no media reports yet to check and validate the info.
    2. There’s no way on earth Ushahidi or anyone else using the service will be calling people to validate the info. Crowd sourcing involves a TON of users with a ton of info. You’d have to have an army of people on the phones to scratch the surface. I don’t even think the ‘threat’ of validating the info by a phone call exists.

    Compared to What:
    Personally, I am not of fan of this question. When I called my non-profit and complained that our online giving system was antiquated, they compared it to other organizations that were worse. The worst excuse I’ve ever heard. We’re called to excellence (as Christians and others who believe in doing moral & ethica acts), let’s get it right no matter what others are doing.

    Who’s asking the question:
    I couldn’t agree more. Ushahidi was initially just built for use by Kenyans, but now it’s for everyone, world wide. So to some extent, my western point of view is just as valid as anyone else’s, since it’s for me just as much as anyone else. Who knows where the next incident in the world will happen and who it will involve. Remember when it was used for the California wild fires? The creators and users of the data were all westerners.

    As ‘developing countries’ become more and more aware and able to use the new web 2.0 (like Hamas does) who knows what they will do. Even though we’re on ‘the good guys’ side’ who’s to say they won’t start making virtual war with tools like this? I hope they don’t ever do that, but that’s the question people are asking I think.

    • Hi Taylor,

      Many thanks for reading and for your informative reply, really helps the brainstorming. Some preliminary thoughts regarding your very good points.

      1. Yes, much of the rationale behind Ushahidi is that it provides a platform for documenting human rights abuses that would otherwise go undocumented. But that doesn’t mean there are never any available media reports. Take Kenya and Gaza, for example. I agree that this doesn’t solve the entire problem, but then again I’m not suggesting it does. I’m simply suggesting that the information vacuum is actually not empty all the time. We need to think about such issues more in terms of information ecosystems.

      2. Your argument would suggest that Wikipedia is an impossibility. I’m a big fan of Yochai Benkler’s work, “The Wealth of Networks” and believe that Ushahidi can crowdsource the filter:

      Ushahidi is already replying, at least by text, to try and validate information. For particularly important alerts, I see no reason why Ushahidi could not call the sender back and get more information. The point is not to have a 100% effective filter but “good enough”; and by good enough I mean that relative to other information collection methods. Furthermore, Ushahidi could let users know when they submit an alert that they may be contacted to validate the information. This could provide the implied oversight necessary to discourage individuals from fabricating alerts, ie, as deterrence.

      3. I think “compared to what?” is an extremely important question, not to be used as an excuse but to counter critics who come across as arrogant and knowing everything.

      4. Perhaps I wasn’t very clear in my post. The point is that those judging the quality of the data should be the ultimate end users, ie, those who rely on the data to make personal decisions regarding their safety. Outsiders who are removed from the local reality are obviously entitled to their opinions, but their criticisms are more egocentric.

      5. Yes, of course there are dual uses of Web 2.0 like any other technology. Cyberconflict & cyberterrorism is not new.

  2. I think the biggest problem for most critics regarding the validity of data in Ushahidi is that they think it is something that it is not. Is the purpose of Ushahidi to give 100% accurate information to the masses? Not exactly. Is the purpose of Ushahidi to provide information to directly affected people as fast as possible? You bet! Ushahidi needs to be treated more like Wikipedia than a traditional news agency in the sense that it’s a great way to quickly assess a situation without all the details.

    So, data validation is still important, regardless of how the system should be treated. Patrick, I think you covered an important base in that the administrators for any given Ushahidi installation are able to quickly and easily contact the reporter about anything that needs to be expanded upon or that seems suspect. This doesn’t necessarily have to be a phone call. The system allows admins to very easily respond to reports via text message directly from Ushahidi asking for more details which doesn’t take any time at all. Because responses back and fourth can take some time to gather all of the little details in a crisis situation, reports can be approved to be displayed on the site but not verified. Users are able to see if a report has been verified as accurate by the administrators. For the people in life or death situations (like your Eastern DRC example), I can only imagine that they would very much appreciate unverified data.

    To give a simple response to the arguments raised regarding users who intend to use the tool for nefarious means, I say it’s worth the risk. With proper moderation and enough users in the system, there very well could be a few false reports. I don’t believe they would diminish the value of the information produced by Ushahidi much during the birth of a crisis.

    Used with a variety of sources, Ushahidi is a great additional resource for anyone, directly involved or not, to glean information about a crisis.

  3. Hi Patrick. Thanks for launching this discussion on crowdsourcing and information reliability. I recently joined Ushahidi as a volunteer, and I have thought lots on this myself. Allow me to elaborate.

    Part of the inspiration for the Ushahidi platform comes from WordPress, and I think that this analogue informs us in part of how we have to look at the Ushahidi platform: It’s a tool for “crowdsourcing crisis-information” just like WordPress is a tool for “blogging”. I think that as a tool, the Ushahidi platform might enable a new flow of information similar to what blogging platforms (WordPress, Movable Type, Blogger and others) have done for blogging. Just as the use of WordPress says nothing about the quality of a particular blog, the Ushahidi platform in itself can never ensure the reliability or relevance of information provided. This is always up to the administrator of the particular installation.

    Years back, there has been lots of discussion (and skepticism) about the reliability of blogs versus traditional reporting: “Anyone can start a blog”, “What to do with anonymous blogs?”, “They have no journalism education.” “They might be biased.” “They might have undisclosed interests.” etc etc. Nowadays, there have many best practices evolved in blogging. Yet, we also see much diversity in approaches, and while there still might be some convergence, I don’t expect this diversity to go away. The diversity might even grow bigger. We have personal blogs, corporate blogs, group blogs, anonymous blogs. Most bloggers enable comments from anyone, some only from registered users, some don’t allow them at all. Some show comments directly after they’re posted, some let them be added to a moderation queue. Some actively police comments which they deem to be spam, offensive, or contain misleading information.

    The key here is that we don’t have to agree on what actually would constitute best practice: We can let the network decide. We can let all users decide for themselves. Information sources which are deemed to be unreliable won’t be listened to much. On the other hand, information sources which are highly reliable but (as a trade-off) provide almost no information need to be complemented by other sources.

    When it comes to information, not anyone has the exact same needs. The “peer-to-peer” model of blogging fits perfectly with that. Anyone can subscribe to any combination of sources that fit their information needs best. Anyone can become a source of information – a blogger – themselves.

    I hope we’ll see the same thing with Ushahidi instances. In fact, I hope that Ushahidi instances will be somewhat like real-time, automated blogs of structured information, and in particular data of events purportedly taking place (although this is still quite general, depending how you look at it). It’d be wonderful if Ushahidi installs could feed of each other in a way similar to how bloggers use their own blog subscriptions. The notable difference is that – since we’re talking about structured data – the passing on of data between Ushahidi installs can be automated in part. However, the human element in filtering information will be of immense importance.

    We could have many different “admins” run their own Ushahidi install, each trying to serve their “audience” the best, just as we see with blogs. This would work best if people could have access to lots unfiltered data as possible. I think in general, it would be good that at least one Ushahidi install would display virtually all incoming data. But it could be that no end-user would ever want to see that unfiltered stream of incoming reports. It may be far too noisy.

    If there’s a promise of privacy when using the specific mobile phone number, of course this promise must be honored. Yet, no-one says every Ushahidi admin should promise to hold phone numbers (or encoded hashes) private. If there’s both an anonymous and a non-anonymous reporting option, mobile reporters can choose themselves. Non-anonymous reporting can be more easily trusted at first, and also more easily verified (anyone who sees the report could contact the reporter now, instead of only the admin of the Ushahidi install).

    In short, what I hope for is lots of diversity, and lots of different places to get your information from. Empowering everyone with powerful crowdsourcing and information management tools is the exact way to achieve that.

  4. @Meryn Stol

    Why is it that the Ushahidi folks haven’t created the user option of being anonymous or not? It does give the content a level of credibility as you say.

  5. @Patrick

    Are you on the Ushahidi Dev team? I hope we get to meet sometime.

  6. Taylor, I don’t think there is a why. It just hasn’t been implemented yet. 🙂 As soon as people want to use the Ushahidi platform for non-violence settings, where people’s lives wouldn’t potentially be in danger for sending in reports, the option for non-anonymous reporting will be asked for often, I’m sure.

    There’s already lots of non-anonymous reporting happening on Twitter. Some people at Ushahidi are already looking at importing tweets on a specific event, based on hash-tags. Those non-anonymous tweets can have some of their trustworthiness inferred, particularly if the person tweeting has already built up some reputation.

    In countries where Twitter is already popular, I expect that Ushahidi will lean on Twitter much. People are already used to reporting on Twitter. The Ushahidi engine could add value by aggregating this, creating visualizations (map, timeline), and by helping users to determine credibility of reports.

  7. Pingback: Internews, Ushahidi and Communication in Crises « iRevolution

  8. Pingback: iRevolution One Year On… « iRevolution

  9. Pingback: Conflict Early Warning Blog: One Year On « Conflict Early Warning and Early Response

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s