Skip to content

Feature engineering for Machine Learning

Published: at 08:58 AM

Feature engineering, plays a crucial role in machine learning algorithms. Feature engineering methods can be applied to numerical, categorical and textual data. Some of the key benefits of feature engineering are:

  1. Improved performance: by feature engineering, you can help the ML algorithm detect patterns easier. For example, in a model where you try to predict the prison sentence a guilty person given the committed crime - a good feature could be identifying adults (age > 17) since juveniles receive reduced sentences for the same crime;

  2. Increase Model Interpretability: by transforming the features or creating new features you can capture meaningful patterns and relationships in the data. This can support the understanding of the underlying data;

  3. Handling categorical and textual data: majority of popular machine learning algorithms require numerical data, while in real life most features are categorical or textual data. Feature engineering can enables us to use such data;

  4. Handling non-linear relationships: feature engineering can help capture non-linear relationships.

Numerical features

Discretization / Binning

Discretization is the method of dividing continuous numerical variables into discrete intervals or bins. This method is great to deal with non-linear relationships or outliers. An example of discretization is identifying a person’s generation based on their age.

IR systems

Scaling and Normalization

Scaling and normalization methods ensure that numerical features are on a similar scale. This is extremely important when you are using one of the following algorithms:

Tree-based models are invariant to feature scaling.

Common methods for feature scaling are:

z_score z_scored

min_max min_maxed

ma maed

iqr iqred

Feature scaling is a step that can greatly improve the performance of machine leaning models. However, it is important to note that not all machine learning models require feature scaling (like tree-based). It is important to scale the numerical features in a similar range, so that features with different ranges don’t dominate the decision making just by their scale.

Feature encoding

Feature encoding is the process of transforming categorical or textual data to numerical representations so that they can be utilized by machine learning algorithms. We will go through the most common feature encoding methodologies.

Categorical data

iqred
iqred
iqred
iqred
iqred
iqred

Textual data

There are multiple methods to transform text data into numerical representation. We will go from the simplest to the most complex methods. It is important to note that we don’t go through all the methods, because this is an active area of research and there is tons of methods that someone could use.

iqred

iqred

iqred

iqred

If you are interested in investigating text representation further, you could examine character-level representation (where you encode the characters as one-hot encoded) and Subword-level representation.

Python implementation

In this section we will go through a practical example illustrating how feature transformations can be applied and how can they impact the performance of our model. We will use the data trying to identify credit card fraud and can be downloaded from here.

We will start by importing the required modules:

import pandas as pd
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression

The next step is to download the data and keep the features we think are useful:

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = ["age", "workclass", "education", "sex", "race", "occupation","hours-per-week", "income"]
data = pd.read_csv(url, header=None, names=column_names)
data = data[column_names]

Let’s write some functions to help us evaluate the transformations faster:

def set_train_test(data):

    # Split the dataset into features and target variable
    X = data.drop('income', axis=1)  # Features
    y = data['income']  # Target variable

    # Split the dataset into training and testing subsets
    return train_test_split(X, y, test_size=0.2, random_state=42)

def evaluate_solution(data):
    
    train_x, test_x, train_y, test_y = set_train_test(data)
    
    model = LogisticRegression()
    model.fit(train_x, train_y)
    probs = model.predict_proba(test_x)

    # Predict on the test set
    custom_threshold = 0.3

    # Adjust the predicted labels based on the threshold
    y_pred = (probs[:, 1] >= custom_threshold).astype(int)

    # Calculate evaluation metrics
    f1 = f1_score(test_y, y_pred)
    precision = precision_score(test_y, y_pred)
    recall = recall_score(test_y, y_pred)

    # Print the evaluation metrics
    print("F1 score:", f1)
    print("Precision:", precision)
    print("Recall:", recall)

Having the data loaded, let’s have a high-level look of them to see what transformations can we apply.

data.info()
iqred

We can see we have only two numeric variables and the rest are categorical.

For the first experiment we will transform the categorical variables using one-hot encoding to form a baseline solution.

one_hot = ["workclass", "education", "race", "occupation", "sex"]
# apply one-hot encoding in categorical variables
one_hot_data = pd.get_dummies(data[one_hot], prefix=one_hot)
encoded_data = pd.concat([data, one_hot_data], axis=1).drop(one_hot, axis=1)
evaluate_solution(encoded_data)
iqred

Let’s conduct a transformation to the numeric variables and see if we can achieve better results.

scale_negative = ["age", "hours-per-week"]
for s in scale_negative:
    scaled_data, _ = stats.yeojohnson(data[s].values)
    encoded_data[s] = pd.Series(scaled_data.flatten())
    
evaluate_solution(encoded_data)
iqred

It turns out there are some benefits from scaling our data, it turns out Recall is getting a slight improvement.

Finally, let’s conduct another transformation to the categorical data and let’s encode one of the features as ordinal.

# ordinal encoding
mapping_encoding = {' Preschool': 0,
                     ' 1st-4th': 1,
                     ' 5th-6th': 2,
                     ' 7th-8th': 3,
                     ' 9th': 4,
                     ' 10th': 5,
                     ' 11th': 6,
                     ' 12th': 7,
                     ' HS-grad': 8,
                     ' Prof-school': 9,
                     ' Assoc-acdm': 10,
                     ' Assoc-voc': 11,
                     ' Some-college': 12,
                     ' Bachelors': 13,
                     ' Masters': 14,
                     ' Doctorate': 15}

one_hot_ordinal = ["workclass", "race", "occupation", "sex"]


one_hot_data = pd.get_dummies(data[one_hot_ordinal], prefix=one_hot_ordinal)
encoded_data_ordinal = pd.concat([data, one_hot_data], axis=1).drop(one_hot_ordinal, axis=1)

for s in scale_negative:
    scaled_data, _ = stats.yeojohnson(encoded_data_ordinal[s].values)
    encoded_data_ordinal[s] = pd.Series(scaled_data.flatten())
    
encoded_data_ordinal["education"] = encoded_data_ordinal["education"].map(mapping_encoding)

evaluate_solution(encoded_data_ordinal)
iqred

You should note, that the examples above are used to illustrate the impact and implementation of transformations you should not apply the transformations and evaluate performance so naively. After applying the transformations in a normal setting, you should do feature selection, hyperparameter tuning, threshold selection, model selection, sensitivity analysis before concluding to the best model. Another note, transformations can be tricky, when you apply them in a new serve set you should not do “fit_transform” method but you should apply the “transform” method (using sklearn lingo here, similar methods are in other modules which conduct data transformations), a nice article illustrating the difference can be found here.

Conclusion

In summary, data transformation plays a crucial role in optimizing model performance by addressing issues related to data representation, scaling, and distribution. Moreover, data transformation enable us to utilize categorical and textual data in our machine learning model. By utilizing appropriate data transformation techniques tailored to the specific characteristics of the dataset, we can enhance the effectiveness and reliability of our machine learning models, ultimately driving better insights and decision-making in various domains.