Predictive modelling in the cloud with scikit-learn and IPython

Speaker: Olivier Grisel, INRIA Saclay


IPython with its notebook interface is an interactive programming environment that is particularly well suited for data exploration, modelling and sharing of analysis results notably via

Scikit-learn a versatile Machine Learning library for Python that blends well with the NumPy and SciPy ecosystem and is used by a growing user-base of both academic researchers and data scientists and engineers in the tech industry.

The two projects offer together a productive environment for building and evaluating predictive models from data. In particular IPython distributed computing capabilities make it possible to offload computational intensive Machine Learning tasks to clusters of tens or hundreds of nodes without breaking the interactive experience.

The goal of the presentation is to showcase how to setup an ad hoc data modelling environment using a cluster provisioned in a public cloud and use it perform common predictive modelling operations such as:
- cross-validated model assessment and automated search for the best parameters for common feature extraction and machine learning algorithms,
- parallel training of out-of-core text classification models for sentiment analysis,
- parallel training of large randomized ensembles of decision trees (a.k.a. Random Forests).