Lets get real about AI; breaking it down to simple parts.

Randy Taylor
6 min readFeb 15, 2021

Full stack development is a beautiful thing. As a full stack developer I love to start at at the database, then build a back-end to process the database, finally finish out with a front-end where users can interact with the data. This has been the best way for me to build full stack systems; Typically I start at the back and move forward. However you can go in reverse and start at the front and then move backward to end with a database.

The idea is many different computer languages and/or frameworks come together to make the full technical stack of a program that work together to create something fully functional. It’s important to understand these languages would not be useful alone. A PostgreSQL database would be useless without any backend program like Ruby on Rails to put and retrieve the entries out of the relational table. Full stack is really about the idea of developing from the storage database to the person who interacts with the program and all the stops in-between.

AI not surprisingly has a full-stack likeness to it. Let me explain. You can probably guess the metaphorical “database” of an AI program is the training dataset on which it is built upon. AI has many different layers but the majority of these systems have a Machine Learning component that they rest upon. This ‘database’ in our analogy is a large training dataset where the Machine “learns” about the past. In this respect data will always be from past events, thus data represent the past history of the world. Fortunately we can use the past to make predictions about the future. That is how people in the real world get smart. We learn from the past to make better predictions about the future. Similarly that is exactly what Machine Learning does.

The initial step of creating an AI system is preparing our training dataset. We talked about EDA in the last post and we will review it. To get this data prepared for Machine Learning you need to start with visualization. Ironically AI still needs human eyes to ‘eyeball’ the data and give it that human broad general intelligence analysis. We can use libraries such as matplotlib and seaborn to simply graph the data. Regardless of what our visualization shows us we still have to quantify it. This is a cool resource for data science an d looking at the big picture of data sets you will need to observe. Here is another resource on getting the mean of a scatter plot dataset. I am posting this because as I mentioned we need to visualize the data first. Then We try to quantify mean, median, standard deviation, inter quartile range, variance, and covariance of this data. We need to identify properties like mean, median, skew, variance, standard deviation, co-variance, and correlation. These are mathematical measurements that can tell us a lot about our data that cannot mislead us. Specifically we might detect correlation that could just be an optical illusion. We can deal with outliers and nan/na values, determine if there is a positive or negative linear relationship, perhaps no relationship. It’s important to use mathematical calculations to determine this as humans can have bias misread into data. For example an outlier is mathematically defined as above or below 1.5 times the Inter Quartile Range. We are also looking for what is called “noise” in the data. Noise can be thought of as not representational to the past and thus not relevant to predicting the future. For example in a house dataset with price, location, and number of bedrooms a very cheap however large number of bedrooms. It is very possible this house does not actually exist and was put in the dataset by accident or a slick realtor was attempting to get calls into his real-estate office and manipulated the data. Again the idea is this idea is not representational to the real world and can be discarded. We will talk about dealing with outliers and noise in the future. It is important to not capture these outliers and noise while training the data as it will end up giving us bad predictions about the future. If you do it is called an “overfitting model”. If you don’t capture enough data in your training set it is called “underfitting” and again will result in bad predictions for the future.

So we are attempting to identify what is referred to as information and exclude noise in training our Machine. Once we have cleaned our data we take this data and start with the most basic form of AI: a linear relationship between two variables. If we can find a clear relationship then if we know one variable we can predict another. The great thing is because of the direct relationship if you know one variable you also know the other. You will use this to create a model. From the model you can accurately predict one variable if you know the other.

Now we try to determine if there is correlation between the two variables. You can read in-depth here. We will be using linear regression(Regression techniques).

Now we need a measure of how good our prediction is. Introduction to residuals and sum of least squares and then Khan academy breaks down the exact math(that you will not have to do) to find the least squares regression line.

Much of this is done via the computing power of libraries of libraries like Numpy but it’s important to understand what is happening.

For example we can tell the quality of the regression line by the finding the standard error of regression. Essentially looking at the standard deviation of between the regression line and the actual data. You can also use .

Now we can use the root-mean-square-deviation and root-mean-square-error for a measurement of how good our prediction model is.

A note on missing data. We want to avoid just deleting data if it is missing as there is a lot of value in it. If a row is missing a single column value we should not just delete his row. That would be a last resort. We should replace this with something more like a best ques as to what this value would be. Replacing NAN or NA values in our data with median or even mean would be suggested.

Much Ado lets get to the Machine Learning!

FreeCodeCamp.org is an amazing resource that has many tutorials. It has this wonderful video tutorial for FREE on Machine Learning with scikt-learn.

We are building a linear regression model to predict a ‘dependent’ variable. We will split our data and then train on this data. We will drop the dependent variable from the collective data. Then this dependent will be stored into a y variable such that we have the x(the collective dataset) to predict the y(dependent). This means we may drop this dependent variable from the training data.

As an aside we can also use linear regression with categorical data. For example ethnicity, car make, or city/country. This is not quantifiable and thus Machine Learning cannot process it directly. It CAN use a trick and turn it into binary so to speak. By replacing these categories into numerical vectors we can process them in our computational regression. The predominant technique is called ‘one hot encoding’ and pandas and sklearn has methods built in to handle it for you and even categorical data that has a logical order to it. You may use the pandas.get_dummies(column_name) to have the same effect. However sklearn has aditional features that will be useful.

So we will use this tool to help us with building a linear regression model:

from sklearn.model_selection import train_test_split

This will be the input in your python3 environment preferably a Jupyter notebook. From these tools you will take your pandas DataFrame and drop the column of the variable you want to predict and set it equal to X.

X = autoCVS_DataFrame.drop(columns=’mpg’)

Now you will pull out this variable and assign it to y.

y = autoCVS_DataFrame[[‘mpg’]]

Now we will create a test_train_split instance with our parent libraries.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

This will both give us sub dataframes for training our Machine Learning and also for testing it too.

from sklearn.linear_model import LinearRegression

Now we can use LinearRegression() from the sklearn library and create a new instance of it withour data:

RE = LinearRegression()
RE.fit(X_train, y_train)

This will give us a “best fit” line to accomidate all of our data we put into the instance of sklearn’s LinearRegression.

Lets look at the intercept with intercept = RE.intercept_[0] whichi will give us -18.283451116372067 .

Now from this we can print out our coefficients and piece together our model:

coefficient’s to build model

Our model looks to be -.3(cylinders) +.02(displacement) -.02 (horsepower) -.007(weight)+.06(acceleration)+.83(model)-18.283451116372067

And there we have it. Our Linear Regression model.

More FREE resources for linear_regression

--

--