# Basic Statistics & Regression for Machine Learning in Python

Basic Statistics & Regression for Machine Learning in Python, A quick and easy guide on statistical regression for machine learning.

Hello and welcome to my new course Basic Statistics and Regression for Machine Learning

You know.. there are mainly two kinds of ML enthusiasts.

The first type fantasize about Machine Learning and Artificial Intelligence. Thinking that its a magical voodoo thing. Even if they are into coding, they will just import the library, use the class and its functions. And will rely on the function to do the magic in the background.

The second kind are curious people. They are interested to learn what’s actually happening behind the scenes of these functions of the class. Even though they don’t want to go deep with all those mathematical complexities, they are still interested to learn what’s going on behind the scenes at least in a shallow Layman’s perspective way.

In this course, we are focusing mainly on the second kind of learners.

That’s why this is a special kind of course. Here we discuss the basics of Machine learning and the Mathematics of Statistical Regression which powers almost all of the the Machine Learning Algorithms.

We will have exercises for regression in both manual plain mathematical calculations and then compare the results with the ones we got using ready-made python functions.

Here is the list of contents that are included in this course.

In the first session, we will set-up the computer for doing the basic machine learning python exercises in your computer. We will install anaconda, the python framework. Then we will discuss about the components included in it. For manual method, a spreadsheet program like MS Excel is enough.

Before we proceed for those who are new to python, we have included few sessions in which we will learn the very basics of python programming language. We will learn Assignment, Flow control, Lists and Tuples, Dictionaries and Functions in python. We will also have a quick peek of the Python library called Numpy which is used for doing matrix calculations which is very useful for machine learning and also we will have an overview of Matplotlib which is a plotting library in python used for drawing graphs.

In the third session, we will discuss the basics of machine learning and different types of data.

In the next session we will learn a statistics technique called Central Tendency Analysis which finds out a most suitable single central value that attempts to describe a set of data and its behaviour. In statistics, the three common measures of central tendency are the mean, median, and mode. We will find mean, median, and mode using both manual calculation method and also using python functions.

After that we will try the statistics techniques called variance and standard deviation. Variance of a dataset measures how far a set of numbers is spread out from their average or central value. The Standard Deviation is a measure of how much these spread out numbers are. We will at first try the variance and standard deviation manually using plain mathematical calculations. After that, we will implement a python program to find both these values for the same dataset and we will verify the results.

Then comes a simple yet very useful technique called percentile. In statistics, a percentile is a score below which a given percentage of scores in a distribution falls. For easy understanding, we will try an example with manual calculation of percentile using raw data set at first and later we will do it with the help of python functions. We will then double-check the results

After that we will learn about distributions. It describes the grouping or the density of the samples in a dataset. There are two types. Normal Distribution where probability of x is highest at centre and lowest in the ends whereas in Uniform Distribution probability of x is constant. We will try both these distributions using visualization of data. We will do the calculation using manual calculation methods and also using python language.

There is a value called z score or standard score in statistics which helps us to determine where the value lies in the distribution. For z score also at first we will try calculation using python functions. Later the z score will be calculated with manual methods and will compare the results.

Those were the case of a single valued dataset. That is the dataset containing only a single column of data. For multi-variable dataset, we have to calculate the regression or the relation between the columns of data. At first we will visualize the data, analyse its form and structure using a scatter plot graph.

Then as the first type of regression analysis we will start with an introduction to simple Linear Regression. At first we will manually find the co-efficient of correlation using manual calculation and will store the results. After that we will find the slope equation using the obtained results. And then using the slope equation, we will predict future values. This prediction is the basic and important feature of all Machine Learning Systems. Where we give the input variable and the system will predict the output variable value.

Then we will repeat all these using Python Numpy library Methods and will do the future value prediction and later will compare the results. We will also discuss the scenarios which we can consider as a strong Linear Regression or weak Linear Regression.

Then we will see another type of regression analysis technique called as Polynomial Linear Regression which is best suited for finding the relation between the independent variable x and the dependent variable y.

The regression line in the graph will be a straight line with slope for Simple Linear Regression and for Polynomial Linear Regression, it will be a curve.

In the coming sessions, we will have a brief introduction about polynomial linear regression and the visualization of the modified dataset with x and y values. Using python we will then find the polynomial regression co-efficient value, the r2 value and also we will do future value prediction using python numpy library.

Then we will repeat the same using the plain old manual calculation method. At first we have to manually find the Standard Deviation components. Then later we will substitute these SD components in the equation to find a, b and c values. using these a,b,c values we will then find the final polynomial regression equation. This equation will enable us to do a manual prediction for future values.

And after that, here comes the Multiple regression. Here in this regression we can consider multiple number of independent x variables and one independent y variable. We will have an introduction about this type of regression. We will make necessary changes to our dataset to match the multiple regression requirement.

Since our dataset is getting more complex by the introduction of multiple independent variable columns, it may not be able to be managed by using a plain array for the dataset. We will use a csv or comma separated values file to save the dataset. We will have an exercise to read data from a csv file and save the data in corresponding data-frames. Once we have the data imported to our python program, we will do a visualization using a new library called seaborn which is a derivative of the scikitlearn library.

Using the python numpy and scikitlearn library, multiple regression can be done very easily. Just use the method and pass in the required parameters. Rest will be done by the python library itself. We will find the regression object and then using that we can do prediction for future values.

But with manual calculation, things will start getting complex. Its a lengthy calculation which needs to be done in multiple steps. In the first step we will have an introduction about the equations that we are going to use in the manual method and also we will find the mean values. Then in the second step we will find the components that are required to find the a,b and c values. Then in the third step, we will find the a,b and c values. And in the final step, using a,b,c values we will find the multiple regression equation and using this equation we will do future value prediction of our dataset. We will also try to get the value of the co-efficient of regression.

That’s all about the popular regression methods that are included in the course. Now we can go ahead with a very important topic in data preparation for machine learning. Many machine learning algorithms love to have input values which are scaled to a standard range. We will learn a technique called data normalization or standardization in which all the different ranges of data values will be scaled down to fit within a range of 0 to 1. This will improve the performance of the algorithms very much compared to a non scaled dataset.

For normalization also, just like the regression examples, we will at first try the normalization using python code which will be very easy to generate values. Then later we will repeat this with plain old school type of mathematical calculations.

In the final session, we will discuss more resources which you can follow for going further from the point that we have already learned.

That’s all about the topics which are currently included in this quick course. The code, notepad and jupyter notebook files used in this course has been uploaded and shared in a folder. I will include the link to download them in the last session or the resource section of this course. You are free to use the code in your projects with no questions asked.

Also after completing this course, you will be provided with a course completion certificate which will add value to your portfolio.

So that’s all for now, see you soon in the class room. Happy learning and have a great time.