xkcd
Campaigns

TS-Si supports open and immediate access to publicly funded research.

Petition: remove women of transsexual / intersex history from the GLAAD Media Reference Guide. [ sign ]
Read: Andrea Rosenfield's call for reform.

Opening Doors to Transsexual Medical Research
TS-Si
is dedicated to the acceptance, medical
treatment, and legal
protection of individuals correcting the misalignment
of their brains and their anatomical sex, while supporting their transition
into society as hormonally reconstituted and surgically corrected citizens.
is dedicated to the acceptance, medical
treatment, and legal
protection of individuals correcting the misalignment
of their brains and their anatomical sex, while supporting their transition
into society as hormonally reconstituted and surgically corrected citizens.
| MINE/MIC Tool Detects Novel Patterns In Large Data Sets |
|
|
| SciMed - Horizons | |||
| TS-Si News Service | |||
| Saturday, 17 December 2011 16:00 | |||
Cambridge, MA, USA. A new software suite implements maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying interesting relationships between pairs of variables in large data sets.The tool incorporates the maximal information coefficient (MIC), a measure of dependence for two-variable relationships, to identify known and novel relationships. Researchers currently use advanced technology to gather big, complex, data sets, which may be incredibly useful in enhancing system understanding, if, in fact, vast amounts of data can be organized so that telling information may be extracted. Ordinarily, that could take a person hundreds of years to analyze by eye. Sophisticated computer programs can search these data sets with great speed, but fall short when researchers attempt to even-handedly detect different kinds of patterns in large data collections. ![]() What might we be missing in large datasets? If researchers printed on paper each potential relationship in a recent data set containing abundance levels of bacteria in the human gut, the stack of paper would reach to a height of 2.25 km (1.4 mi), 6 times the height of the Empire State Building. Image courtesy of Sigrid Knemeyer.Investigators at the Broad Institute found it can uncover patterns multiple, recurring events or sets of data in large data sets, such as complex data on global health, changing gut bacteria, a season of major league baseball, and much more, and do it in a way that no other software program can. "There are massive data sets that we want to explore, and within them, there may be many relationships that we want to understand," said Broad Institute associate member Pardis Sabeti, senior author of the paper and an assistant professor at the Center for Systems Biology at Harvard University. "The human eye is the best way to find these relationships", says Sabeti, "but these data sets are so vast that we can't do that. This toolkit gives us a way of mining the data to look for relationships." The researchers tested their analytical toolkit on several large data sets, including one provided by Harvard colleague Peter Turnbaugh who is interested in the trillions of microorganisms that live in the gut. The findings appear in the journal Science. Working with Turnbaugh, the research team harnessed MINE to make more than 22 million comparisons and narrowed in on a few hundred patterns of interest that had not been observed before. "The goal of this statistic is to take data with a lot of different dimensions and many possible correlations and pick out the top ones," said Michael Mitzenmacher, a senior author of the paper and professor of computer science at Harvard University. "We view this as an exploration tool it can find patterns and rank them in an equitable way." One of the tool's greatest strengths is that it can detect a wide range of patterns and characterize them according to a number of different parameters a researcher might be interested in. Other statistical tools work well for searching for a specific pattern in a large data set, but cannot score and compare different kinds of possible relationships. MINE, which stands for Maximal Information-based Nonparametric Exploration, is able to analyze a broad spectrum of patterns. "Standard methods will see one pattern as signal and others as noise," said David Reshef, a co-first author of the paper who is currently a graduate student in the Harvard-MIT Health Sciences and Technology program and also worked on this project as a graduate student in the department of statistics at the University of Oxford. "There can potentially be a variety of different types of relationships in a given data set. What's exciting about our method is that it looks for any type of clear structure within the data, attempting to find all of them." Not only does MINE attempt to identify any pattern within the data, but it also attempts to do so with an eye toward capturing different types of patterns equally well. "This ability to search for patterns in an equitable way offers tremendous exploratory potential in terms of searching for patterns without having to know ahead of time what to search for," said David Reshef. ![]() David Reshef and Yakir Reshef (brothers) developed the maximal information coefficient (MIC), a measure of dependence for two-variable relationships, to identify known and novel relationships, under the guidance of professors from Harvard University and the Broad Institute. Image courtesy of ChieYu Lin.MINE is especially powerful in exploring data sets with relationships that may harbor more than one important pattern. As a proof of concept, the researchers applied MINE to social, economic, health, and political data from the World Health Organization (WHO) and its partners. When they compared the relationship between household income and female obesity, they found two contrasting trends in the data. Many countries follow a parabolic rate, with obesity rates rising with income but peaking and tapering off after income reaches a certain level. But in the Pacific Islands, where female obesity is a sign of status, countries follow a steep trend, with the rate of obesity climbing as income increases. "Many data sets will contain these types of complicated relationships that are guided by multiple drivers," said Sabeti. MINE is able to identify these. "This greatly extends our capability to find interesting relationships in data." Researchers can use MINE to generate new ideas and connections that no one has thought to look for before. "Our tool is a hypothesis generator," said Yakir Reshef, a co-first author of the paper and a graduate student in the Weizmann Institute of Science. "The standard paradigm is hypothesis-driven science, where you come up with a hypothesis based on your personal observations. But by exploring the data, you get ideas for hypotheses that would never have occurred to you otherwise."In addition to testing the ability of the suite of tools to detect patterns in biological and health data, the researchers examined data collected from the 2008 baseball season. "One question that we thought would be particularly interesting would be to see what things were most strongly associated with salary," said David Reshef. The researchers generated a list of relationships, finding that the strongest associations with salary were hits, total bases, and an aggregate statistic that reflects how many runs a player generated for a team. "Given the stakes, baseball is so well documented. We're curious to see what can be done in this realm with tools like MINE." Researchers from many different fields, including systems biology, computer science, statistics, and mathematics, all contributed to this project. "People are getting better at combining data from different sources, and in some ways, this project is in the spirit of that," said Yakir Reshef. "The project brought together authors from many disciplines. It symbolizes the kind of collaborations that we hope people will use this for in the future." FundingFunding for this work was provided by the Packard Foundation, Marshall Aid Commemoration Commission, National Science Foundation (NSF), European Research Council (ERC), and the National Institutes of Health (NIH).
ParticipationOther authors who contributed to this work include Hilary Finucane, Sharon Grossman, Gilean McVean, and Eric Lander.
CitationDetecting novel associations in large data sets. David N. Reshef, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, Pardis C. Sabeti. Science 2011; 3346062: 1518-1524. doi:10.1126/science.1205438
Abstract Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.
Email this
Comments (0)
![]() Write comment
|
|||
| Last Updated on Saturday, 17 December 2011 16:10 |



Cambridge, MA, USA. A new software suite implements maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying interesting relationships between pairs of variables in large data sets.

hypothesis
The TS-Si News Service is a collaborative effort by TS-Si.org editors, contributors, and corresponding institutions. Sources can include the cited individuals and organizations, as well as TS-Si.org staff contributions. Articles and news reports do not necessarily convey official positions of TS-Si, its partners, or affiliates. We welcome your comments. Use the form below to leave a public comment or send private correspondence via the TS-Si Contact Page. We will not divulge any personal details or place you on a mailing list without your permission.
The TS-Si News Service
and the TS-Si Research Service are collaborations of TS-Si officials, staff, contributors, and corresponding institutions. The contents do not necessarily convey official positions of TS-Si or its owners, participants, partners, or affiliates.