Computational Complexity and the Data Quality State

At the end of last week, I afforded myself the treat of attending the second annual Wyoming Biotech Conference. This event is seriously off-topic for me, having nothing to do with my start-up software company’s current focus on building mobile apps for field data collection and problem reporting. But it’s a great thing to get out of the office and hear directly from people going after Big, Hairy, Audacious Goals, especially in fields far away from one’s own. And, to be honest, Wyoming operates a bit like a small company, with our academics and government employees wearing many hats, so even at an ‘off topic’ conference, one can do a lot of useful networking.

The speakers were uniformly excellent and very accessible for follow-up questions, during the sessions and breaks. Jay Stender and Forward Sheridan are to be congratulated on pulling together, for the second year in a row, a remarkable event.

What I found the most interesting was that three of the five speakers explicitly addressed a theme I’ve been picking up on lately in my own, strictly amateur, reading about science: computational complexity is getting to be a limiting factor, both in terms of generating original results, and, perhaps more importantly, in terms of practitioners being able to evaluate and depend on each other’s results.

Biologists now have the DNA sequencing firepower to generate the complete genomes not just for a single exemplar of a species, as was the goal of the Human Genome Project, but for multiple individuals within that species. In theory, this should allow us to better understand both everyday genetic variation and harmful/helpful mutations. But the amount of data being produced is researcher-boggling and presents enormous problems in terms of analysis, data storage, data transmission, and quality assurance.

Harvey Blackburn, Coordinator of the National Animal Germplasm Program talked a bit about how he expects research funding to shift over the next several years. Historically, the funding has primarily gone to projects that develop techniques for generating data, such as improved DNA sequencing. In the future, he expects more and more funding to go to projects that focus on how to analyze data so we can make sense of the rivers of raw data we are now capable of producing.

In my casual reading, I keep coming across issues in quality assurance. Both within their own labs and when reviewing the work of others, it is getting increasingly difficult for researchers to be sure that the computational techniques they apply to their data are giving them answers they can depend upon.

An article, in a recent issue of Economist, about cancer research results being retracted well after human trials based on it were begun included an all-too-typical sentence: “the internal committees responsible for protecting patients and overseeing clinical trials lacked the expertise to review the complex, statistics-heavy methods and data produced by experiments involving gene expression.”

fMRI techniques are now commonly used by neuroscientists to locate processes within the brain by measuring increased oxygen consumption caused by specific stimuli. The analysis relies on an enormous amount of post-collection data processing for everything from filtering out the noise caused by head movements to normalizing the data from brains of different sizes and shapes. As an NPR piece once put it, “it takes a whole lot of computer processing and human judgment to get from oxygen levels to a snapshot of love in the brain.” A researcher from Dartmouth College once illustrated the problem by using fMRI scans of a dead fish to generate ‘results’ indicating the fish had a strong neural reaction to being shown emotionally-charged photographs.

Physics isn’t any better off. Everything from the search for exoplanets to the identification of elementary particles has come to depend on an enormous amount of computation and statistical analysis struggling to keep up with the vast amounts of data generated by instruments and experiments.

Back at our conference, the consensus was that there will be a huge need over the coming decades, in all sorts of scientific disciplines, for practitioners who can marry solid domain knowledge with either the mathematical skills to develop new algorithms for analysis or the computational skills to implement those algorithms and insure we can depend on the results or, ideally, both. Statistics, numerical analysis, and pattern matching have always seemed pretty dry and stodgy to me. No more. All would be a great places to be working, right now, if you are a young person in college or grad school.

And, for Wyoming, which has not one but two new supercomputer projects under construction, there is huge opportunity for becoming one of the places where big chunks of that raw data get turned into dependable, verifiable results. If we get good enough at it, perhaps we could come to be known, among researchers, not as The Equality State but, rather, The Data Quality State.

Leave a Reply

Your email address will not be published. Required fields are marked *