Info for your AI

Randy Taylor
7 min readFeb 7, 2021

Software is modeled after the real world. Much of the time you can draw strong comparisons between the two. A class in python can have children that inherit “traits” from their parent, a software tree can have a root node and have many “leaf” nodes a the top, etc, etc , etc. Artificial Intelligence follows this pattern of software being modeled after the real. So first lets ask “what makes a person Intelligent?”. This is a broad question and hard to strictly define. However we can all agree that intelligence has a “knowing” and a “predicting” component to it. If you have knowledge of things and are able to make what I will call a “good guess” about things it shows intelligence.

Artificial Intelligence or AI does exactly these two things. However for a computer program to have AI it needs a source of information to have the “knowing” component. This will give the AI something to analyze, process, and to form a working “mental model” off of.

Exploratory data analysis is a means to augment this information and pre-process it so it’s ready for the Machine. We will call information data from here on out and through EDA we will convert data into something more like information where it is useable and has meaning.

As we have talked about, Machine Learning is a subclass and even fundamental to AI. The data in EDA has two classifications: Structured Data and Unstructured. Structured data is where data lives inside of a structure like an excel spreadsheet, relational table, or dataframe. The other type, Unstructured Data, typically comes in a media format like videos, pictures, songs, mp3, mp4, etc.

With structured data also known as “tidy data” there are specific concepts to understand which will help us process it. For example there is an idea of central tendency. This is the idea that this data typically hovers around the mean or arithmetic average where we simply add all the values and dividing the number of values added. This series on central tendency is exceptionally explained if you want a deeper dive on this topic you can follow it. However this idea can be inappropriate and misleading because there may be outliers that will skew this mean making it unrepresentative of the data. If this is the case then a median or middle value or mode (value appearing most often) can be used instead.

We should consider dispersion or the magnitude of how much our data differs to the measure the central tendency. This gives us information on how much variability there is between values within a dataset. Specifically measurements of dispersion gives us the range (max-min)and variability or highest, lowest, standard deviation, inter quartile range, and variance.

Standard deviation(SD) is the square root of variance. SD is the best way to protect our data from outliers that threaten to augment the big picture of the data. Conversely variance squared is the standard deviation. Variance tells us how far or close values are from the mean. A variance of zero tells us the values are the mean. Again Variance is the square of the average distance away from the mean.

Quartiles are a way to divide the data into four parts that we call four quartiles. This is helpful to protect your data from outliers that will skew your data. Quartile 2 or Q2 is the median and at this position 50% of the data is above it and 50% are below it. Q3 is the upper quartile. The third quartile or Q3 is the median of the higher 50% so 50% of the upper vales (25% of the total values)will be above Q3(this will be 25% of the total data lives to be clear). We can use these values to determine how spread out our data is with the Inter Quartile Range Q3 -Q1 which is resistant to extreme values. The IQR is a good measurement of dispersion.

Another measure of dispersion is the coefficient of variation. This measurement is calculated by dividing the standard deviation by the mean then multiply by 100. This gives us an idea of the magnitude of variation between the values.

Further we can look at individual values within a data set and get an idea of it’s magnitude of variance with a Z score. This is calculated with taking the value itself subtracting the mean and then dividing the standard deviation. This tells us how many standard deviations above or below the mean the value is. We can also look at the 5 number summary or the min, Q1, median(Q2), Q3, and max. Skewness of the data tells us about the shape of the data set. If the median is not the same as the mean then the data has skewness. If the median is lower than the mean it tells us the data has positive or right skewness and thus has an outlier that pulls the mean to the right.

We just covered a lot of statistical math for our data science. Now lets turn to the keyboard to work with the above topics.

A boxplot is a graphical representation of dispersion, quartiles, standard deviation, and skewness. Boxplots will cut off values over/under 1.5 times the Inter Quartile Range.

We can expand these ideas to multiple variables and see if we can find any correlation. We can calculate the coefficient of covariance which is an absolute value. This will give us positive values giving an indication the two are correlated. A negative value will tell us the two are negatively correlated. You multiply the coefficient of variance of variable 1 by the coefficient of variance of variable 2 and divide by the number of variables.

Now that we have talked about the mathematics important to data science we can look at the built in methods given to us by data science libraries Numpy, Pandas, Matplotlib, and Seaborn.

Matplotlib offers a histogram method that will show us these ideas we have been discussing. A histogram is a graph of numeric data that lets you observe different frequencies. With matplotlib.pyplot.hist(x=”your_x_value”, data=your_data) will give you the desired graph. You can visually do EDA with a histogram.

I have been using this great repo for Machine Learning to get data sets at the University California Irvine. I pulled the csv file for forest fires from the repo.

Import libraries in jupyter notebook

As you can see from the histogram the most common area burned is 0. Not very exciting but great for the earth.

A panda dataframe has methods built in. Such as Dataframe.mean(), .mode(), median(), .quantile(q=0.50) where q is where you pass in what quantile you want. As you can see above we took the mean and median from our fire dataframe and plotted it on the hist in red and yellow.

With quantile you can calculate the Inter Quantile Rage with Dataframe.quantile(q= 0.75)-Dataframe.quantile(q=0.25) for all columsn of your data frame however you can do this visually with a seaborn boxplot which will have the IQR and whiskers. A Seaborn will graphically show us the the quartiles.

However Pandas has a boxplot method you can use on an entire dataframe.

There are built in methods for variance and standard deviation such as DataFrame.var() and DataFrame.std().

There is a covariance method DataFrame.cov() and DataFrame.corr() that will give us the covariance and correlation.

But these are just numerical representations of this data. We can plot it out graphically. Seaborn has a pairplot with a linear regression argument that will give a line of linear regression for each pair of variables.

Just by looking at the linear regression lines you can gauge what is related to what. I can tell you that in the graph above the the Fire Weather Index (FWI) and Fine Fuel Moisture Code (FFMC) show a strong regression lines.

Even better we can generate a heatmap that will show in graphical color what areas are “Hot” or correlated with:

As you see in the graph the diagonal line is always perfectly correlated (1) as it is pairing with itself on the other axis. But you can see FFMC and DMC are showing up as .38 correlation and Temp and DC is a .5. These have a high correlation.

--

--