The cure for cancer may already be in our hands. It’s possible that somewhere in the zettabytes of data amassed and archived around the world hides a key sequence of events and outcomes relating to malignant cell growth. That critical insight, however, may be unlinked and out of context, beyond the grasp of the hundreds of thousands of researchers looking for it.
Welcome to the era of big data, in which everything we know is at our fingertips. Unfortunately, that means answers can be clouded by the way we’re asking the questions. Today, the doctor who leads the breakthrough against cancer is just as likely to have a Ph.D. in computer science or big data analytics as in medicine.
Dr. Carolyn McGregor, for example, is a computer scientist and Canada research chair in Health Informatics at the University of Ontario Institute of Technology, just east of Toronto in Oshawa. Part of her focus is unlocking the mysteries around premature babies’ survival rates.
What first intrigued her was the sheer volume of data being captured from each child. Approximately 3,600 points of hard data were generated, but treatment often came down to what the nurse on duty would write in the chart, based on observation.
A drop in respiration or heart rate or a spike in temperature was all but forgotten an hour later if the baby’s condition “seemed normal. The nurse or the doctor wouldn’t really look at the device, they would look at the baby,” she said. “If a baby gets sick, they really don’t know why or what happened before.”
The issue was not a lack of care or passion, she said. The volume of data is so great and the understanding of what it all means and how it connects is so limited that relying on human intuition and experience has been the only way to cope. And despite the stream of data collected, only the most recent has actually been captured; nothing has been archived.
Predicting which babies would develop infections or other difficulties is the challenge of premature care, and McGregor’s work with IBM and The Toronto Hospital for Sick Children is to make sense of the complex, to add context to the data and to begin to assemble a picture from the puzzle pieces to better predict outcomes.
There’s no shortage of data: babies are fitted with a plethora of sensors and cables, streaming 1,256 points of data every second. In a ward of 30 babies, that’s 90 million points of data a day. McGregor and her team started capturing the data and analyzing it, then writing queries against the database.
Little by little, they started to see patterns which now allow them to predict events 12 to 24 hours in advance, based on heart-rate patterns. As the project moves from the clinical phase to the practical, the goal is to scale the concept up, allowing data from remote locations to be analyzed centrally.
As in the hospital example, this world of big data is one we created. It exists from our smartphones to our laptops, from tablets to global server farms and out into the cloud.
We create data directly, in digital photos and spreadsheets, and indirectly, when purchasing an item kicks off data streams in inventory and customer management systems, and in credit card databases. And we’re creating more of it every second of every day.
Last Spring, IDC said the data-management market will grow from US$3.2 billion in 2010 to US$16.9 billion in 2015, a compound annual growth rate of approximately 40 percent, making data the biggest sector of the ICT market.
More than three billion of us will be online, generating nearly eight zettabytes of data a year.
The bad news is that much of this data is untagged and unorganized, all but lost in cyberspace. The paradigm shift in big data, however, is in the combination of hardware and software to get a handle on all that archived data and, at the same time, tap into the raging torrent of incoming data.
First, mediation or middleware software sits on top of the data and “talks” to existing or legacy databases, regardless of formats and search queries. That alone isn’t good enough because hard drives are too slow. What’s needed is a way to access data instantly, and the dropping price and rising capacity of solid-state memory is enabling just that.
This is the so-called “in-memory” process that shifts data to a solid-state medium, just like RAM on your PC. In-memory processing makes real-time data mining on the fly possible, which is exactly what Dr. McGregor and her team are doing, and large-scale enterprises are starting to do as well.
Scott Camarotti, vice-president and country manager, Software AG Canada, said big data analytics brings big challenges.
“There are four simultaneous challenges for business,” Camarotti said. “One, there’s a massive amount of data and it’s only going to grow; two, the velocity of the data means [companies] have to increase their ability to read, write and update (from it); third, a lot of that data is unstructured; and finally, they have to be able to extract the value for the business.”
Luckily, vendors are creating new techniques. A case in point, he said, is Visa. It used to take up to 45 minutes to identify fraudulent transactions. With big data in memory — Camarotti calls it Big Memory — and Software AG’s applications, that was reduced to four seconds. “That’s measurable success. You can identify the fraud while there is still some possibility of enforcement at the point of sale.”
Similarly, Goldman Sachs was struggling to ratify trades within a four-hour window, he said, but after deploying solid-state memory to handle 500GB at a time, there was a 300 percent improvement in processing time and the sector went from an unstable part of the business to a high performer, opening doors for more discussions around other applications.
Selling yet another IT solution in the midst of a lingering recession, however, is sometimes a challenge, said Wayne Ingram, managing director, technology at consultancy Accenture in Canada. “[Buyers] fall into two camps. The IT camp gets the technology and what it can do,” he said. The non-IT group is more skeptical, having been sold on technology for years that doesn’t always perform as promised.
Still, he said, the components are mostly in place and the cost is incremental since the hardware and software are relatively cheap compared to the cost of large-scale server farms. The payoff from real-time data mining makes the pursuit worthwhile because “there’s gold in that data. I worked in the utility sector and we saw it in action there,” he said.
“You’d have a generator fail on a system but, instead of just replacing the generator, we knew from the data there would be a series of related failures of other components so, while we’re out there, we replaced those.”
But it all comes down to the question and how you ask. And somewhere, that cure for cancer is waiting for the right question.