Marking 20 years
of bold journalism,
reader supported.
News
Media
Science + Tech

The Promising Rise of Big Data

Scientists have too much info to sort. They're asking your help. Latest in our Citizen Science series.

Brent Boles, Brendan McConnell and Lola Fakinlede 3 May 2013TheTyee.ca

Brent Boles, Brendan McConnell and Lola Fakinlede, journalism students at the University of Western Ontario, are part of a team producing this Citizen Science series for co-publication by The Tyee and Rabble.

Stephen Brabin is very good at folding proteins. He twists and contorts their structures, bending them to his will. He does this not for his living, but for fun.

Brabin, a citizen scientist, is one of the top ranked players of Foldit, an online computer puzzle game that asks users to manipulate virtual proteins.

It might be a game for Brabin, but for people at Foldit it means real world medical advances. The results from the game could help develop AIDS medication among other breakthroughs.

And Brabin's solution is just one of roughly 500,000 which the people behind Foldit will have to wade through. This massive amount of information is 'big data'.

Big Data?

Big data refers to the vast and unprecedented amounts of data that scientists now have access to because of advancements in scientific technologies. In addition, citizen scientists are collecting large amounts of data every day using communications technologies such as smartphones.

There are human and natural activities that are also generating a lot of information. Everything from cell phone signals, scientific experiments, social media and many, many other processes are adding to big data.

Andrea Wiggins, of the data archiving company DataOne, describes big data as "a general set of practices and conditions around having way more data than we've ever had before."

But it's not just that we have more data now, she said. We also have a lot more access to it. And "more data is more knowledge," she added. (Watch a video interview with Wiggins at the top of this story.)

In cancer research for example, sophisticated technology such as genetic sequencing allows scientists to analyze tumour samples faster and in much more detail than in the past.

However, this generates much more data than researchers can handle.

When analyzed thoroughly, scientists are able to make new and groundbreaking discoveries that would not have been possible without big data. They can also easily identify trends and gain insights about what the information is saying. And with the help of citizen scientists, they can now do it much quicker.

Galaxy Zoo is one way the public is helping scientists analyze their big data. It's an interactive website that displays pictures of galaxies and asks questions about shape, size and colour. This allows scientists to create databases of accurately classified galaxies for their research.

With millions and millions of galaxies in a single data set, scientists have to rely on the help of citizen scientists to comb through and identify them one at a time.

Click to hear an audio interview with Kevin Schawinski, founder of Galaxy Zoo.

Why humans and not machines?

Advancements in scientific technology have made it possible to capture much more data than we can even imagine. But if we have the technology to collect it, why can't we use that same technology to analyze it?

Some scientists have looked at a number of methods -- such as automated algorithms -- to see if they can train a computer to analyze big data faster.

However, computer algorithms simply don't have the same ability to recognize patterns as the human eye, said Dr. Joanna Owens of Cancer Research UK.

A lot of the data that they are looking at have subtle changes and shifts in patterns, or differences in colour that the human eye is great at distinguishing, but is difficult to train a computer to do accurately.

"We believe there's a lot of potential for getting the collective eye of the public to help us analyze research faster and maybe more effectively," said Owens.

Cancer Research UK has an interactive website called Cell Slider which is similar to Galaxy Zoo. But, instead of analyzing galaxies, here citizens help to examine cancer cells.

By simply spotting how many cancer cells are in a sample, how many are of them are stained yellow and what proportion of those are very bright, citizens help scientists test the level of the success of breast cancer treatment.

"So we're getting data back that is helping us link what we're seeing under the microscope with the outcome of a woman's treatment for breast cancer," said Owens.

Interactive websites like Cell Slider have easy to understand tutorials.

And because the process is structured in a way that harnesses the human brain's natural ability to recognize patterns, the websites are easy to work with, said Kevin Schawinski, founder of Galaxy Zoo.

Getting it right

With so much data coming in, and a lack of experience among those working with it, some people may be concerned about the accuracy of the projects.

In terms of collection, it's about understanding the potential sources of errors, said Wiggins. There are two methods to ensure accuracy, quality assurance and quality control.

Quality assurance is a procedure that happens before the data is collected, said Wiggins.

"It's a specific protocol we follow. We control the data entry fields so that bogus data can't be entered."

She said that citizens follow procedures that produce useful information.

Quality control happens once the data is collected. It is a method that throws out any data that has too many errors or inconsistencies.

After the data is collected, a new challenge arises.

When it come to data analysis, "repeat observation is like a gold mine," said Wiggins.

For instance, with Galaxy Zoo there are so many people analyzing the same data, the probability of getting an accurate result is increased, explained Schawinski.

"One thing we've discovered is that because we have 20 to 40 people independently look at each galaxy, the Galaxy Zoo classifications are very accurate," added Karen Masters, a professional astronomer.

Click to hear an audio interview with Karen Masters.

Big Data, simplified

Only trained professionals have the background necessary to understand scientific data. In order to engage the general public and get useful information in return, developers have to find ways to simplify complex content.

Before launching Cell Slider, the scientists conducted some surveys to determine how the presentation of their data affected people's tendency to come back and play, and how comfortable people felt about looking at cancer cells. They inverted the colours of the cells and added some abstract imagery to make the pictures more beautiful.

"The big challenge for us was getting people to return to science. We wanted to make it exciting and engaging enough so that people keep coming back," said Owens.

And in the case of Foldit, they turned the complex process of protein manipulation into a video game.

"One of the big issues was basically showing the players only the required information," explained Tamir Husain, one of the game's developers.

The team stripped protein folding down to its core. In the game, proteins appear as 3D cartoon images users can manipulate in space. The problem areas are represented as red balls with spikes called clashes.

To a scientist, the clash represents a complex scientific problem, but to a video game player it's simply a hurdle that needs to be overcome before getting the high score.

Big results

So what happens after you're done folding proteins and classifying galaxies? There are forums and research boards where citizens can get more information about the research they are helping with.

There, they can ask researchers questions and discuss the project with their fellow citizen scientists. It's an opportunity for citizen scientists to stay involved with the project beyond the games.

One of the biggest breakthroughs in terms of citizen science and big data came in 2011, when Foldit players solved the Mason-Pfizer Monkey Virus puzzle.

Scientists had been agonizing over the protein's structure for years, but by crowdsourcing the data to players, they had their solution in a matter of days. Their discovery could have major implications for developing AIDS medication.

In 2007, a citizen scientist stumbled across a unique and potentially groundbreaking astronomical object hidden in a Galaxy Zoo image.

Schawinski said that it's an object that scientists have never seen before and they are still trying to work out exactly what it is.

This is an example of a discovery that likely would have been overlooked by a computer algorithm and solidifies the benefit of using the human eye as a means for processing big data.

Other scientists credit citizens for cutting their research time in half and in some cases like Cell Slider, by much more.

"We have sped up the time it would take to carry out that data from 18 months to three months -- freeing up our scientists to carry out more research," said Owens.

Looking ahead

Scientists say that this is just the beginning of an exciting collaboration with the public. "I believe that it's an area that has huge potential. We are already looking at other types of data that we might be able to ask the public to help us analyze," said Owens.

Schawinski believes the value of using citizen scientists to process big data is something that is going to evolve and become more prominent in the future. And he says the way it will be done will likely change. "Humans and machines are going to collaborate more," said Schawinski.

He expects to see platforms with machines that can decide which objects need to be looked at closely by humans. In turn, humans will make sure that the machines are working sensibly and that nothing odd or unique is missed, he said.

For now, the human mind remains a unique tool for dissecting information. Computers have immense power, but they lack pattern recognition skills that humans are very good at.

"It is one of our goals to try and capture the intuition that players have and turn that back into algorithms that can be run on computers," said developer Jeff Flatten of Foldit.

In the meantime, the data continues to pile up and scientists are relying on normal citizens like Stephen Brabin to keep helping them out.  [Tyee]

Read more: Media, Science + Tech

  • Share:

Facts matter. Get The Tyee's in-depth journalism delivered to your inbox for free

Tyee Commenting Guidelines

Comments that violate guidelines risk being deleted, and violations may result in a temporary or permanent user ban. Maintain the spirit of good conversation to stay in the discussion.
*Please note The Tyee is not a forum for spreading misinformation about COVID-19, denying its existence or minimizing its risk to public health.

Do:

  • Be thoughtful about how your words may affect the communities you are addressing. Language matters
  • Challenge arguments, not commenters
  • Flag trolls and guideline violations
  • Treat all with respect and curiosity, learn from differences of opinion
  • Verify facts, debunk rumours, point out logical fallacies
  • Add context and background
  • Note typos and reporting blind spots
  • Stay on topic

Do not:

  • Use sexist, classist, racist, homophobic or transphobic language
  • Ridicule, misgender, bully, threaten, name call, troll or wish harm on others
  • Personally attack authors or contributors
  • Spread misinformation or perpetuate conspiracies
  • Libel, defame or publish falsehoods
  • Attempt to guess other commenters’ real-life identities
  • Post links without providing context

LATEST STORIES

The Barometer

Are You Concerned about AI?

Take this week's poll