Building Data Science Applications with FastAPI Develop, manage, and deploy efficient machine learning applications with python
FREE PDF BOOK FULL!

Steps to use Python in data science:
Thanks to an extremely rich API ecosystem, it makes it possible to process data of a wide variety of types (including SQL but also noSQL) and to pilot advanced processing tools (in particular Spark with PySpark for the massively parallel processing of so-called big data ).
In this article, my goal is to present a fairly standard process for developing data processing tools with the Python language by introducing the appropriate libraries. python in data science The use of python in data science
Python has taken the lead over many other languages thanks to 3 factors:
The simplicity of the language: for an object-oriented programming language, python has an ultra-fast learning curve. You are very quickly operational in Python. A few days are enough to acquire the basics of the language and to make you operational. The multitude of libraries (libraries or packages according to the terminology): setting up a library in Python is extremely simple and this has allowed the publication of specialized libraries by research teams. The impressive number of APIs to other programs or other environments. It is extremely easy to connect to other environments with python. These three points make python the language of choice for many projects and especially in the context of data processing and data science.
Setting up a python project in data science:
Whatever your project, big data, IoT (internet of things), "classic" data processing, a certain number of questions must be asked when setting up your project. 1- The i/o (input / output) It is a question of defining the inputs and outputs in the broadest sense of the word. The input information and the objectives. In the context of an IoT project, for example, the input will be the readings taken by the object and the output will be either to display these readings, or to display decisions to be taken, or to take a decision directly. Once these inputs and outputs are defined at the global level, they must be identified at the local level. This is the format of the input and output data. Is it data retrieved in real time, data stored in databases, data stored in files...? For the outputs, in the same way, we will have to wonder about the format of the data to be returned.
Should the results be stored in databases, should they be transmitted to objects in the form of commands?
The answer to these questions will allow you to define the tools to use in your data science python program. 2- Volumetrics: Today, there is a lot of talk about big data and for many of you, it is still a vague concept. The question that arises for each project is: should I use so-called "big data" technologies or can I keep the technologies I currently use? Of course, the answer depends on the context. We generally differentiate 4 cases: My data is already structured in databases (SQL type) and I do not foresee any rapid evolution of the volume. In this case, the use of classical technologies such as MySQL, SQLite... is adapted and will induce an import of python libraries allowing to make queries in SQL and with treatments using the python tools. My data are structured, not very large and I don't have important time constraints (daily, weekly..., no real time). In this case, we will prefer importing data with python tools from databases or flat files. The mass of data recovered is very important and I need to perform heavy processing in terms of analytics with machine learning algorithms. In this case, you will have to set up massively parallel calculations using so-called "big data" tools. Apache Spark based on Hadoop clusters will be preferred to store your data. Python will allow you to manage the whole process, especially with the use of the PySpark API. I will not go into detail about this kind of case in this article, I will soon publish an article about Apache Spark and all its specificities. The amount of data is important but the treatments are light. In this case, we can use Hadoop clusters for storage and MapReduce type processing for analysis. You can also use Python with an API to do MapReduce. Python will serve you for these 4 cases, I focus on the first 2 in this article with the use of machine learning algorithms.