Crowdsourcing chemistry: Gift cards serve as incentive in Kaggle competition for data-driven discoveries

Machine learners from across the globe accept the challenge to develop models to predict molecular photostability
Illustration of molecular bonds
Amazon gift cards were used as incentive for an online competition to develop machine learning methods for predicting the photostability of nine unreported molecules.

Inspired by their own recent chemical discoveries for solar energy development using machine learning, a research team at the University of Illinois Urbana-Champaign was curious what other data-driven chemical discoveries could be possible if they could access a broader machine learning approach.

The team, which included chemistry professors Martin D. Burke and Nicholas E. Jackson, and chemical and biomolecular engineering professors Charles M. Schroeder and Ying Diao, and recent chemistry PhD graduate David Friday came up with a unique crowd-sourcing idea.

With three $50 Amazon gift cards as incentive for top finishers, the Illinois researchers sponsored a competition on Kaggle – an online community of more than 24 million machine learners – and challenged them to develop machine learning methods for predicting the photostability of nine unreported molecules that were synthesized by the research team just for this contest. 

“With access to this huge community on Kaggle, we thought, ‘If we want new ideas, why not just ask the whole community?’” said Friday, who took the lead in organizing the competition.

Burke said it was the first Kaggle Competition to interface AI and small molecule discovery. And, Jackson added, it provided a new paradigm for community engagement, because they synthesized a “test set” of molecules specifically for the broader, crowd sourced community to engage with in the competition.

Researchers
From left: Martin D. Burke, Nicholas E. Jackson, and David Friday

“This means that the public is not just following up on our work, but is actively engaged in new research,” Jackson said.

The contest garnered 174 official submissions from all over the globe, including the youngest participant, an 18-year-old in India whose work landed among the top 20 finishers. Participants’ expertise ranged from no chemistry background to some with chemistry and computer science expertise. There were 500 contest entrants and 174 completed the challenge and submitted an official entry. Each could submit up to five entries, so there was a total of 729 submissions, each containing a unique method to solving the problem. 

“By posing our chemistry problem as a data science problem, we captured a wide community and engaged with a very broad audience, most of whom would never have read our original paper,” said Friday.

Friday said he was hopeful for just 20 submissions. The researchers were thrilled with the level of participation from around the world.

“The feedback we got (from participants) was this was a really interesting data set, and this was a challenge they hadn’t seen before and was far more open ended and unique,” Friday said. “The hope is that this might inspire new ways for chemists and computer scientists to collaborate on solving real chemistry research challenges.”

The top machine learning models generated in the contest can be employed to predict the photostability of molecules, which is critical for developing commercially viable organic solar cells. But Friday said these models could also be useful for predicting other valuable properties of small molecules, like, for example, the ability to penetrate the blood brain barrier or reactivity for a particular coupling reaction. 

“The methods generated are easily transferable to a variety of tasks,” Friday said. “Accessing the creativity of the millions of people on Kaggle was the key to generating these useful models.”

The idea for the competition stemmed from a recent research project led by Burke, Jackson, Schroeder, and Diao that revealed some key chemistry principles for improving molecules for the development of organic photovoltaics. Commercialization of organic solar cells has been hindered by problems with photostability because many materials degrade when exposed to light. 

As the researchers explained to Kaggle participants, a common challenge in experimental chemistry is trying to infer causality from small datasets. Chemists synthesize molecules, measure an important property, and then try to infer which molecular features are important in determining that property. In their recent project published in Nature, the Illinois researchers found that a purely data-driven ML approach applied to discovering and enhancing molecular photostability for organic solar cells was capable of identifying important molecular features not previously identified by experts and generating a regression model for photostability that was validated on new molecules. Their research produced light-harvesting molecules four times more stable than the starting point and their work revealed crucial new insights into what makes them stable.

“By using machine learning models, we could predict which molecules would degrade the fastest, which has been a very difficult prediction task,” Friday explained. “Our ML models found one very unique molecular feature that hadn't been identified before, and then experimentally demonstrated its importance in enhancing photostability. We were excited to see that machine learning, without any human biasing, had identified this new feature that was important for understanding photostability.”

In post-project discussions, the researchers were curious whether other machine learning methods could be employed that would be better than their method or different from their method but still useful in other molecular design and development scenarios.

“The Kaggle idea was very much a synthesis of Marty (Burke) wanting to push the chemical boundaries and Nick’s (Jackson) expertise in the Machine Learning world and knowledge of what resources are available. When we identified the opportunity, I ran with it,” said Friday, who added that he could have spent years trying to replicate what this worldwide competition did in one month. “This isn’t normal research. It’s crowd sourcing. But it worked.”

In addition to providing contestants with data from their research paper, the team selected nine new molecules, synthesized them, calculated their features, and made the full molecular dataset available to the competitors who then used their data science skills and resources within the Kaggle community to develop their machine learning models to predict properties of the new molecules, which were measured in the lab.

The Illinois researchers are now working on a journal paper about their Kaggle Competition and hope this approach will be employed more in the future. 

“This contest tested a lot of ideas in a short amount of time,” Friday said. “And gave people direct access to solving cutting-edge chemical problems and being part of the solution to forward progress in chemical research.”

News Source

Tracy Crane, Department of Chemistry

Date