Discussion on Feature Engineering + Model Tuning

In Feature Engineering and Model Tuning we start with EDA. This is to ensure our data is structured in a way our Machine Learning Algorithms can usefully learn from it. There are things we should be looking for in order to pick out the most beneficial features and ignore the non-useful ones. You must consider the curse of dimensionality. This is when our features become more complex and either sufficiently dense or sufficiently spread out leaving large gaps between them such that it becomes impossible to make meaningful relationship between them. For example if 10 features are so far away from each other what is the relationship that should be drawn between them? Is it a parabola, line, curve? It might not be very clear and an automated algorithm will most likely default to a strait line/plane. Further what if we only have a handful of records between these features? This is what is known as the curse of dimensionality. Lets set up a game plan to aid in Feature Selection:

  1. We use visualizations and calculations to identify both outliers and missing values. The handling of these values will skew the data. Missing values are usually best to not drop as they have useful data attached within remainder of the row. Typically the median or mode should be used as a replacement. Mean can also be used to replace these values however be careful the mean is not skewed with outliers. Outliers must also be addressed. Typically the easiest way is to ‘trim’ them from the data. This is typical for any outliers that are obvious errors in data collection and imputation. Capping can be used to replace the outliers with the highest Standard Deviation. However if there are too many outliers you will change the distribution that will longer be normal after the change. The other option if you do not want to lose the data is winsorization. Here replace the outlier with the next closest value that is not an outlier. There are options for replacing outliers and missing values with Sklearn. Simple Imputer will replace all missing values with a pre described mean, median, or mode. The more advanced option is to use the KNN imputer where k is the number of neighbors you wish to use for hypertuning. A good measure of k should me the square root of n(n being the number of records). However as n becomes large the time complexity will grow slowing down training and a safe number will always be 10. KNN will use the number K given to it and average this number of records and generate the synthetic data for the replacement. Lastly if feature has a very high percentage of missing values you could use this as a feature itself. You could create a new column showing that this feature has too many missing values as a binary True or False letting the algorhythm know it missing.
  2. If a Feature has too much variance then and you have many outliers it is possible you have multiple gaussians. This means that the dataset should be divided into multiple sub-datasets. For example if your are building a model to predict vehicles and you have Trucks, SUVs, minivans, and two/four door sedans. This means that your dataset is too diverse to describe with one model. Hence you will need multiple models for each sub-dataset. A good indication of this is both too many outliers and too much variance in your dataset. Conversely if you have too little variance then this feature will not help your model learn and you should just drop it.
  3. Through EDA if you see a very high correlation between two features then you can drop one. This is because these features are essentially duplicates. I personally would put the threshold at .92 percent to be high enough to drop. They might be describing the same thing with a different label. For example I have seen age and experience being this highly correlated and are providing the same information in a different way.

K-Nfolds is a validation technique that splits up the data into K datasets where K-1 is left out and used for testing. You then iterate over the entire K sub datasets and test each on a different K-1 subfold. With this technique you average the accuracy of all the sub datasets.

If you find imbalance in your dataset you must address it. Machine Learning aims to learn from a dataset in order to produce a model that will be predictive for a target class. Unfortunately more often than not our dataset is unbalanced making our training less effective. This means if we represent our target feature as a boolean it is most often False. If we want to predict if someone is a terrorist and the vast majority of our data is comprised of people who are not a terrorist we have less examples to learn from what is a terrorist. Upsampling and Downsampling are two techniques used to overcome this limitation. This techniques artificially suppress the marjory class or synthesize artificial data of the minority class. Downsampling is limited by training on less data and thus valuable information. Upsampling is limited by attempting to create data from the existing data which many artificially strengthen correlations. Imblearn is a package that offers upsampling and downsampling functionality. With Imblearn you can use SMOTE(synthetic Minority Oversampling Technique) that implements KNN to artificially create data points in the dataset. The Tomek links or T-link method is an alternative that uses the KNN nearest neighbor of the Majority Class and Drops those values. Conversely we can use Imblearn for understampling that will shrink the Majority class to balance with the Minority. The main issue with these is that when we have less data to train on we lose predictive ability. This will mean that we could improve both improve in predicting the Minority class while simultaneously decrease the ability to predict the Majority class.

If you find yourself in the curse of dimensionality you are probably experiencing overfitting linear models that perfectly fit the training data then fail when testing. We can add a new parameter to the model called the regularization parameter that aims to shrink the affects of features on the over fitting model. Once added we will have less accuracy in training but much better in testing. There are two distinct types: Ridge (which penalizes the features contribution by pushing them toward zero)and Lasso (penalizes the magnitude of the coefficients and some will become zero thus having no effect) parameters.