__Practical Data
Science with Jupyter Explore Data Cleaning, Pre-processing, Data Wrangling,
Feature Engineering and Machine.__

## Data Science with Jupyter:

The Python environment for data science: The richness of open-source languages is the possibility to use packages developed by specialists. Python is particularly well endowed in the field. To caricature, we sometimes read that Python is the second best language for all tasks, which makes it the best language. Indeed, the malleability of Python means that we can approach it in very different ways depending on whether we are more SysAdmin, web developer or data scientist. It is this last profile that will interest us here. The data scientist must have many strings to his bow. This is reflected in the ecosystem of data science which is quite fragmented. However, this expansion is not unique to Python since R offers even more packages than Python where a number of standardized frameworks limit the breakdown of the ecosystem. In addition, the proliferation of the data-scientist environment is a real opportunity since it allows packages to specialize in a field, where they are more efficient, and package designers to dare to implement new methods, essential for language to keep pace with rapid changes in research or technology. Python packages essential for the course and life of data scientists This post, from which the image above is taken, summarizes most of the useful packages for a data scientist or an economist/sociologist. We shall limit ourselves here to those used daily. numpy: numpy handles all matrix computation. The Python language is one of the slowest languages that are 1. Not all quick calculations are written in Python, but in C++ or even Fortran. This is the case with the numpy package. This one is essential as soon as you want to be fast. The scipy package is an extension where you can find statistical, optimization functions. pandas: Above all, a good data scientist must be able to appropriate and manipulate data quickly. For this reason, pandas is unavoidable. It supports most data formats. To be efficient, it is also implemented in C++. The package is fast if we use pre-implemented methods on data of a reasonable size (compared to the available RAM). However, we must be wary of this with large amounts of data. As a general rule, a data set requires three times as much memory space as the data takes up on the disk. matplotlib and seaborn: matplotlib has existed for twenty years to provide Python with graphical functionalities. This is a very flexible package, offering many features. Nevertheless, in recent years seaborn has emerged to simplify the creation of some standard charts of data analysis (histograms, bar chart, etc.). The success of seaborn does not, however, eclipse matplotlib since it is often necessary to finalize the customisation of a seaborn2-produced graphic; scikit-learn: scikit-learn is the most popular machine learning module for three reasons: it relies on an extremely consistent API (fit, transform and predict methods, respectively, to learn data, apply transformations and predict on new data); it allows to build reproducible analyses by building data pipelines; its documentation is a model to follow. INRIA, a French institution, is one of the driving forces behind the creation and maintenance of scikit-learn