A CART to pull through more complicated problems.

Randy Taylor
12 min readMar 21, 2021
CART to pull through the complex issues.

Linear Regression, the simplest of Machine Learning techniques, is beautiful and works really well. Based on continuous numeric inputs it is easy and understandable like a fundamental truth. Logistic Regression(classification), is similarly simple and works equally well by simply substituting a category for a 1 or a 0. We can no longer model data with a line but instead must use a sigmoid function(which looks like a S on the cartesian coordinate system) and extrapolates that the target is either of the numbers in question. How very computer of it to do so. It reminds you everything is a 1 or 0 to the computer. However there has to be more to the mystery of the sphinx. Don’t we need a more complex model to solve exceedling complicated problems? As we dive into a deeper and ever complicated problems our model approaches the abyss of complexity. Why yes sphinx, I see you now. How will we solve this mystery?

Lets step back and review. Linear/Logistic Regression provides a line/plane/Sigmoid to fit the data. In turn you can use this line/plane/Sigmoid to predict what would be (e.g. price, cooking time, etc) based on what is (e.g. house size, kind of meat). Meaning if you had a 1 lbs cut of beef and data on cooking time of different types of meat you could predict how long to cook it. For more sophisticated problems this 1:1 relationship will no longer sufficiently describe the real world. As too many variables (what we call features in Machine Learning) clunk up the the data processing of the machine this simple model fails. Let's look at a classic problem that is far to complex for Linear Regression and far too important to leave unanswered. If a person has a tumor can we predict it is malignant or benign. I'd like digress for a moment to thank Dr. Savard for both being one of the hardest professors I encountered in college (I was one of 3 who got an A in his class), and also for stating that if you “really want to help people go into Engineering”. I’m paraphrasing and I’m sure he’s dead now so RIP.

With all of the features such as size, density, age of the person, gender of the person, if they were a smoker, etc a line/plane/Sigmoid ceases to be predictive. How do all of these features contribute to the end target? Decision Trees are one such way to conquer this issue and are easy enough to understand. With binary decision trees you divide the data into two groups based on a criteria all the way down. Each split will be based on a different criteria. It could be male/female, tumor size bigger than X, density of said tumor. Whatever split is shown to give us the best initial improvement is where we start to divide and that will be our root node. From this boolean value you split the data into two parts. The tree would have a True branch and a False branch. Now we have two subsets of the data. We split the data again and again from these subtrees at a different criteria. This process happens all the way down. We will need a measure of what is an improvement at each step down.

The classic measure is the gini impurity(this blog post is excellent on gini). If the sub-group’s gini is better than the gini impurity of the original dataset the split was an improvement. This is much like the banking system in financial markets in that it is needed to even exist. Without the ability to measure if bifurcating the data into two subgroups was beneficial for prediction this method would not be possible.

We do a lot of set up in this blog. I think that this is a reflection that I have noticed that if you get your ideas together first there are less errors once you actually start coding. This is also much like the EDA that is needed for Machine Learning and subsequent data cleansing. Now we talked about the metaphysics lets get to the good stuff. HOW?

Lets import a dataset on heart attacks from the Irvine Learning Repository, clean it up EDA visualization , and then split the data for testing and training.

read the csv dataset

This is a dataset on people who had heart attacks. The target 1 indicating they had one and a 0 is they did not.

This is the initial decision tree

The first decision tree produced by sklearn’s DecisionTreeClassifier did okay at 74% accuracy of predictability. However the 100% accuracy of the training data tells us a lot.

It tells us the model is so tightly fitting it overfits.

Lets graph the tree and just to look at it before we modulate hyperparameters.

fea = list(X.columns)
feature_names = fea
plt.figure(figsize=(20,30))
out = tree.plot_tree(myTree,feature_names=fea,filled=True,fontsize=9,node_ids=True,class_names=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()

This is a very deep decision tree and overfitting as we can guess by the 100% accuracy on training data. So deep we couldn’t even capture it all of it here. We could export it as it’s own file. However we can also just print it:

print(tree.export_text(myTree,feature_names=fea,show_weights=True))#=>
|--- cp <= 0.50
| |--- exang <= 0.50
| | |--- ca <= 0.50
| | | |--- thal <= 2.50
| | | | |--- thalach <= 96.50
| | | | | |--- weights: [1.00, 0.00] class: 0
| | | | |--- thalach > 96.50
| | | | | |--- weights: [0.00, 17.00] class: 1
| | | |--- thal > 2.50
| | | | |--- restecg <= 0.50
| | | | | |--- weights: [2.00, 0.00] class: 0
| | | | |--- restecg > 0.50
| | | | | |--- age <= 41.00
| | | | | | |--- weights: [1.00, 0.00] class: 0
| | | | | |--- age > 41.00
| | | | | | |--- weights: [0.00, 3.00] class: 1
| | |--- ca > 0.50
| | | |--- trestbps <= 109.00
| | | | |--- chol <= 233.50
| | | | | |--- weights: [0.00, 2.00] class: 1
| | | | |--- chol > 233.50
| | | | | |--- weights: [1.00, 0.00] class: 0
| | | |--- trestbps > 109.00
| | | | |--- weights: [16.00, 0.00] class: 0
| |--- exang > 0.50
| | |--- thalach <= 166.50
| | | |--- thalach <= 106.50
| | | | |--- oldpeak <= 1.00
| | | | | |--- weights: [0.00, 1.00] class: 1
| | | | |--- oldpeak > 1.00
| | | | | |--- weights: [4.00, 0.00] class: 0
| | | |--- thalach > 106.50
| | | | |--- weights: [48.00, 0.00] class: 0
| | |--- thalach > 166.50
| | | |--- chol <= 238.00
| | | | |--- weights: [0.00, 1.00] class: 1
| | | |--- chol > 238.00
| | | | |--- weights: [2.00, 0.00] class: 0
|--- cp > 0.50
| |--- oldpeak <= 2.10
| | |--- trestbps <= 176.00
| | | |--- age <= 56.50
| | | | |--- trestbps <= 119.00
| | | | | |--- age <= 46.50
| | | | | | |--- weights: [0.00, 8.00] class: 1
| | | | | |--- age > 46.50
| | | | | | |--- age <= 50.00
| | | | | | | |--- weights: [3.00, 0.00] class: 0
| | | | | | |--- age > 50.00
| | | | | | | |--- weights: [0.00, 3.00] class: 1
| | | | |--- trestbps > 119.00
| | | | | |--- weights: [0.00, 52.00] class: 1
| | | |--- age > 56.50
| | | | |--- sex <= 0.50
| | | | | |--- thal <= 2.50
| | | | | | |--- weights: [0.00, 12.00] class: 1
| | | | | |--- thal > 2.50
| | | | | | |--- weights: [1.00, 0.00] class: 0
| | | | |--- sex > 0.50
| | | | | |--- chol <= 253.00
| | | | | | |--- ca <= 0.50
| | | | | | | |--- age <= 65.50
| | | | | | | | |--- weights: [0.00, 9.00] class: 1
| | | | | | | |--- age > 65.50
| | | | | | | | |--- trestbps <= 154.00
| | | | | | | | | |--- weights: [1.00, 0.00] class: 0
| | | | | | | | |--- trestbps > 154.00
| | | | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | | |--- ca > 0.50
| | | | | | | |--- thalach <= 148.00
| | | | | | | | |--- weights: [0.00, 2.00] class: 1
| | | | | | | |--- thalach > 148.00
| | | | | | | | |--- weights: [2.00, 0.00] class: 0
| | | | | |--- chol > 253.00
| | | | | | |--- trestbps <= 119.00
| | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | | |--- trestbps > 119.00
| | | | | | | |--- weights: [6.00, 0.00] class: 0
| | |--- trestbps > 176.00
| | | |--- weights: [2.00, 0.00] class: 0
| |--- oldpeak > 2.10
| | |--- chol <= 239.50
| | | |--- weights: [6.00, 0.00] class: 0
| | |--- chol > 239.50
| | | |--- age <= 64.50
| | | | |--- weights: [0.00, 3.00] class: 1
| | | |--- age > 64.50
| | | | |--- weights: [1.00, 0.00] class: 0

At every split of the data we divide on a different feature of this data. We can grab this from our model and parse out the respective importance of it.

importances = myTree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
cp for chest pain was the biggest factor and the root split was here

This is the classical problem with decision trees and we will have to pre-prune and post-prune this to improve it’s predictability.

Lets pre-prune our tree and add some hyper-parameters to limit how big it gets. This will prevent overfitting and thus be more predictive once we take the training wheels off.

Pre-prune tree

Accuracy is the measure of what it got right out of all of it.

Accuracy = TP(what was really positive)+ TN(What was really negative)/TP+TN+FP(What is predicted to be positive but was really negative)+FN(What was predicted to be negative but was really positive)

Recall is a percentage of what the model predicted was True divided what was actually true. Recall = TP/TP+FN. In the pre-pruned model our recall was better too.

Our accuracy was better for both training (you can see it is not overfitting) and in testing(2% better).

Lets optimize.

What is nice is sklearn provides a built in method that will allow us to add ranges of hyperparameters. With GridSearch we can test all hyperparameters against each other and return the best combination. It’s important to note we are iterating over all of the possible decision trees with all of the possible hyperparameters. I don’t know the exact bigO of this but it is at least bigO(n²). From below we can look at ‘max_leaf_nodes’ and know this is bigO(n⁴).

parameters = {'max_depth': np.arange(4,10), 
'min_samples_leaf': [ 7, 10,15 ],
'max_leaf_nodes' : [10,15,20,25 ],
'min_impurity_decrease': [ 0.0001,0.001,0.01],
'criterion': ['entropy','gini'],
'splitter': ['best','random'],

}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(myTree, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(Xtrain, ytrain)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
XX = estimator.fit(Xtrain, ytrain)

This took about a minute to train and now we will test it and print out the confusion matrix.

Improvement on recall and about the same on accuracy.

It’s important to note different measures of correctness have different implications. Recall will be very important when you are prediction something very important such as a cancerous tumor or if a violent offender will reoffend in prison. This ensures we are labeling the True values mostly True.

print(tree.export_text(XX,feature_names=fea,show_weights=True)) |--- cp <= 0.36
| |--- exang <= 0.92
| | |--- sex <= 0.53
| | | |--- weights: [3.00, 11.00] class: 1
| | |--- sex > 0.53
| | | |--- ca <= 0.15
| | | | |--- weights: [4.00, 10.00] class: 1
| | | |--- ca > 0.15
| | | | |--- weights: [14.00, 1.00] class: 0
| |--- exang > 0.92
| | |--- restecg <= 0.71
| | | |--- weights: [31.00, 0.00] class: 0
| | |--- restecg > 0.71
| | | |--- weights: [23.00, 2.00] class: 0
|--- cp > 0.36
| |--- sex <= 0.62
| | |--- weights: [1.00, 35.00] class: 1
| |--- sex > 0.62
| | |--- oldpeak <= 0.72
| | | |--- chol <= 219.78
| | | | |--- weights: [0.00, 11.00] class: 1
| | | |--- chol > 219.78
| | | | |--- weights: [6.00, 24.00] class: 1
| | |--- oldpeak > 0.72
| | | |--- weights: [15.00, 21.00] class: 1

We pre-pruned our tree to get a minimal increase in accuracy/recall. We will now try post-pruning with cost-complexity-pruning.

Cost-complexity-pruning has a different approach where we will intentionally build an overfitting tree. Then we will start to remove sub-trees within that tree. For each sub-tree that we remove we increase the error of the tree because the overfitting tree was perfect to the training data. We will find the error of the new tree with the removed sub-tree vs the original tree itself. We do this every time a sub-tree removed. Once we remove this sub-tree we record what is called alpha which it the amount of error increased into the tree. We will then remove the sub-tree that adds the most value (least increase in error). Recursively start from the top and do it all over.

This will allow us to build a path of decision trees and their alpha value. We can plot this path against either accuracy or recall. Then we can find the optimum alpa where the increase in error to the tree also has the best increase in accuracy or recall.

treeT = DecisionTreeClassifier(random_state=1)
path = treeT.cost_complexity_pruning_path(Xtrain, ytrain)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
alpha vs gini impurity
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(Xtrain, ytrain)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
train_scores = [clf.score(Xtrain, ytrain) for clf in clfs]
test_scores = [clf.score(Xtest, ytest) for clf in clfs]
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ',best_model.score(Xtrain, ytrain))
print('Test accuracy of best model: ',best_model.score(Xtest, ytest))
recall_train=[]
for clf in clfs:
pred_train3=clf.predict(Xtrain)
values_train=metrics.recall_score(ytrain,pred_train3)
recall_train.append(values_train)
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(Xtest)
values_test=metrics.recall_score(ytest,pred_test3)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
Best fitting tree for recall
|--- cp <= 0.50
| |--- exang <= 0.50
| | |--- ca <= 0.50
| | | |--- weights: [4.00, 20.00] class: 1
| | |--- ca > 0.50
| | | |--- weights: [17.00, 2.00] class: 0
| |--- exang > 0.50
| | |--- weights: [54.00, 2.00] class: 0
|--- cp > 0.50
| |--- weights: [22.00, 91.00] class: 1

--

--