Data ScienceAnalyticsArtificial inteliigenceLLM

Traditional ML vs. LLMs in Loan Applications: A Comparative Analysis

By Ayoola Fakoya

Published on: February 3, 2024

Sharing

Traditional ML vs. LLMs in Loan Applications

Introduction:

The dominance of Large Language Models (LLMs) has reshaped industries, and the financial sector is no exception. In this exploration, we'll delve into the realm of loan applications, comparing the effectiveness of models like HistGradientBoostingClassifier, DistilBERT, OpenAI's ChatGPT, and Langchain. Brace yourselves for an insightful journey through ethical considerations and the application of cutting-edge open-source tools.

Disclaimer: The discussion of Large Language Models (LLMs) in loan approval decisions is for educational purposes and raises ethical concerns.

Building a Predictive Loan Model:

let's embark on a journey to construct a predictive loan model using open-source tools like Sqlite, Bert, Langchain, and ChatGPT. To access the code, visit the GitHub repository: Loan App Experiment.

Understanding the tools: Gain insights into the tools at our disposal:

BERT (Bidirectional Encoder Representations from Transformers): BERT, which stands for Bidirectional Encoder Representations from Transformers, is a state-of-the-art machine learning model developed by Google for natural language processing tasks. Unlike traditional models that process words in a sentence sequentially, BERT takes into account the full context of a word by looking at the words that come before and after it. This bidirectional approach allows BERT to understand the nuances and complexities of language more effectively. BERT has been pre-trained on a large corpus of text and can be fine-tuned for specific tasks such as sentiment analysis, question answering, and named entity recognition, making it a versatile and powerful tool in the field of AI.
DistilBERT: DistilBERT is a smaller, faster, cheaper, and lighter version of BERT. Developed by Hugging Face, it is designed to distill the knowledge of BERT by training on the prediction of masked words and the next sentence prediction. Despite being 40% smaller, DistilBERT retains 95% of BERT's performance, making it more efficient for deployment in production environments or on devices with limited computational resources. It's particularly useful for tasks that require real-time processing, such as on mobile devices or in web applications. DistilBERT, like BERT, can be fine-tuned for a wide range of natural language processing tasks.
Langchain: LangChain is a framework that simplifies the development of applications using large language models (LLMs). It provides a standard interface for chains and a variety of integrations with other tools. LangChain allows developers to build applications that can dynamically respond to data.
Agents are systems that use a language model to interact with other tools. They can be used for tasks such as grounded question/answering, interacting with APIs, or taking action. Unlike predetermined chains of commands, agents in LangChain can dynamically adapt their behavior, making them suitable for a wide range of tasks and interactions. They have access to a suite of tools that they can utilize to perform tasks. The core idea of agents is to use an LLM to choose a sequence of actions to take. In chains, a sequence of actions is hardcoded (in code).
HistGradientBoostingClassifier: A scikit-learn-powered machine learning model utilizing histograms, excelling in speed for large datasets and native support for missing values.
ChatGPT: Developed by OpenAI, a powerful language model adept at generating human-like text based on context and past interactions.

The Role of LLMs in Financial Organizations:

Large Language Models (LLMs) offer numerous advantages within financial institutions, including handling unstructured data, enhancing customer service, efficient document processing, risk assessment, regulatory compliance, and improving productivity.

Predictive Analytics with LLMs:

LLMs prove to be powerful tools in predictive analytics for loans, scoring applicants, identifying patterns, and delivering personalized product recommendations beyond traditional models.

Challenges of LLM: Overcoming Cost, Bias, and Data Volume:

Implementing LLMs for loan prediction presents challenges such as high costs, bias concerns, and massive data requirements. A pragmatic approach involves leveraging model services or open source models initially, with subsequent data accumulation over time. Direct Policy Optimization (DPO) emerges as an effective methodology to align LLMs with human preferences, simplifying the process for efficiency and stability.

What is DPO:

DDirect Policy Optimization (DPO) is a method used in reinforcement learning, especially when fine-tuning large language models (LLMs) to align with human preferences. Unlike the traditional approach, which involves training a reward model and then using an algorithm to align the language model's output with human preferences, DPO simplifies this process. It uses a relationship between reward functions and optimal policies, allowing the reward maximization problem with constraints to be solved in a single policy training phase. The resulting algorithm is stable, efficient, and doesn't require significant computational resources.

Another variant of DPO, known as Direct Preference-based Policy Optimization, learns directly from preference data without the need for any reward modeling. This is particularly useful when it's challenging to formulate a reward function. These methods provide a more efficient and stable approach to fine-tuning LLMs, making them more adaptable and responsive to user preferences.

Loan Prediction Experiment:

Now, let's delve into a practical experiment. We start by importing necessary packages and libraries, including pandas, Hugging Face Transformers, seaborn, and LangChain.

We load our loan data, which includes features such as gender, marital status, education, credit history, loan amount, and property area, along with the loan approval status.

import pandas as pd

# Load data
df = pd.read_csv("loan_data.csv")

The data contains missing values, but instead of dropping rows, we opt for a tree-based algorithm capable of handling missing values effectively.

Next, we drop the irrelevant "Loan_ID" column and separate features and target variables:

# dropping Loan_ID and separating features and target variable
X = df.drop("Loan_Status", axis=1)
y = df["Loan_Status"]

To gain insights into the data, we create visualizations like heatmaps and scatterplots to understand the correlation between different features. Calculating the correlation matrix is crucial for comprehending the relationships between numeric features in the dataset.

# creating a heatmap and scatterplot
import seaborn as sns
import matplotlib.pyplot as plt

# Visualize correlation matrix
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.show()

Visualizations and analyses play a pivotal role in the data analysis and machine learning pipeline, offering crucial insights into the relationships between features. The correlation matrix, a key tool in this process, provides a comprehensive overview of how each numeric feature interrelates with others, enhancing our understanding of variable relationships. Recognizing high correlation between two features signals potential redundancy, allowing for the removal of superfluous features during model development. The use of a heatmap to graphically represent the correlation matrix facilitates the identification of patterns and relationships, aiding in decision-making regarding feature engineering or model selection.

As you can see from the above heatmap output, there is a slight positive correlation between ApplicantIncome and LoanAmount, Lets us expore it further.

Let's build a scatter plot

# Create the scatter plot
plt.figure(figsize = (12, 4))
plt.title("Applicant Income vs. Loan Amount ")

plt.grid()
plt.scatter(X["ApplicantIncome"], X["LoanAmount"], c = "c")
plt.xlabel("Applicant Income")
plt.ylabel("Loan Amount")

plt.show()

Analyzing individual loan applicants, the majority fall within the income range of 0 to approximately 30,000, seeking loans up to about 200. While a few data points extend to higher incomes, there's no discernible pattern suggesting a direct correlation between increased income and higher loan amounts.

Let's delve deeper by generating a pair plot using seaborn to explore numerical features related to loan applicants. The correlation matrix encompasses scatter plots and histograms, offering comprehensive insights into the dataset.

Let's interpret the results of the pair plot:

Applicant Income Histogram:
- The histogram unveils the distribution of applicant income.
- Most applicants exhibit incomes below 20,000, with blue bars indicating approved loans ('Y') and orange bars representing denied loans ('N').
Applicant Income vs Coapplicant Income:
- A scatter plot comparing applicant income with coapplicant income.
- Data points cluster at the lower end, suggesting lower incomes for both applicants and coapplicants, without a clear correlation.
Applicant Income vs Loan Amount:
- The scatter plot suggests a mild increase in loan amount with higher applicant income.
- However, the correlation between these variables is not notably strong.
Coapplicant Income Histogram:
- Depicts the distribution of coapplicant income.
- Most coapplicants have incomes below 10,000.
Coapplicant Income vs Loan Amount:
- Scatter plot displaying loan amount against coapplicant income.
- Widespread scattering implies no discernible trend or correlation.
Loan Amount Histogram:
- Illustrates the distribution of loan amounts.
- The majority of loans fall below 200.

Feature Engineering and Model Training

In our quest for a robust loan prediction model, we navigate the essential steps of feature engineering and model training. Let's delve into the code:

Feature Engineering:

# Identify categorical columns
categorical_cols = [
    "Gender",
    "Married",
    "Dependents",
    "Education",
    "Self_Employed",
    "Property_Area"
]

# Apply label encoding to binary categorical variables
label_encoder = LabelEncoder()
for col in categorical_cols:
    # Check if the column is binary
    if len(X[col].unique()) == 2:
        X[col] = label_encoder.fit_transform(X[col])

# Apply one-hot encoding to multi-category categorical variables
X = pd.get_dummies(X, columns=["Dependents", "Property_Area"], drop_first=False)

The categorical_cols list serves as our compass, guiding us to identify the categorical columns in the dataset. Utilizing the LabelEncoder from sklearn.preprocessing, we encode binary categorical variables, converting non-numerical labels into numerical labels. Additionally, we employ pd.get_dummies for one-hot encoding, creating binary columns for each possible value in multi-category columns.

Model Training:

# Split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Initialize and train a HistGradientBoostingClassifier model
from sklearn.ensemble import HistGradientBoostingClassifier

model = HistGradientBoostingClassifier(random_state=SEED)
model.fit(X_train, y_train)

With our features engineered, we move on to training our model. We split the dataset into training and testing sets using train_test_split. Then, we introduce the powerful HistGradientBoostingClassifier model, known for handling both numerical and categorical data, managing missing values, and creating interpretable models that capture complex non-linear relationships.

This combination of tweaking features and training the model sets the stage for a thorough exploration of our loan prediction model. As we go forward, we expect to discover patterns and details in the data, laying the foundation for a strong and insightful financial forecasting tool.

Using my model, I can identify the features that play a significant role in decision-making. Model explainability is crucial to understanding why your model makes specific decisions. Here's an example:

The above shows the feature importance of various factors in a dataset related to loan approval or similar financial assessments. Credit History is by far the most crucial factor in this context, followed by Loan Amount and Applicant Income as significant but lesser factors. Other elements like property area type, education, marital status, number of dependents, gender and employment status have relatively low impact according to this representation. Determining feature importance in a HistGradientBoostingClassifier model using Shapely is imossible as it does not natively support the feature_importances_ attribute like some other models do. However, you can use permutation importance as a workaround. Permutation importance works by shuffling one feature at a time and observing the effect on model performance.

Integrating DistilBERT and HistGradientBoostingClassifier: Bringing Clarity to Loan Approval Decisions

In the final stretch of our experiment, we seamlessly integrate DistilBERT and HistGradientBoostingClassifier to provide a comprehensive summary and rationale for loan approval decisions.

DistilBERT Integration:

# Initialise the DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
bert_model = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")

# Define a text generation pipeline
text_generator = pipeline("fill-mask", model=bert_model, tokenizer=tokenizer)

Here, we set the stage by initializing the DistilBERT tokenizer and model. The text_generator pipeline is configured for text completion using the masked language model.

Generating Loan Application Summary using BERT:

# Generate a loan application summary using BERT
input_text = f"""
    Applicant: Mr Ayoola Flames
    Income: ${sample_features['ApplicantIncome']}
    Credit History: {sample_features['Credit_History']}
    Loan Amount: ${sample_features['LoanAmount']}
    Property Area: {property_area_value} [MASK]
"""
bert_generated_text = text_generator(input_text)

Combining Model Predictions and BERT-Generated Text:

# Combine model predictions and BERT-generated text for loan approval decision
if y_pred[sample_index] == "Y":
    loan_approval_decision = "Approved"
else:
    loan_approval_decision = "Denied"

This code snippet creates a detailed loan application summary using BERT, incorporating essential applicant details and masked values for the model to fill in, as you can see in the output


Applicant: Mr. Ayoola Flames
Income: $5417.0
Credit History: 0.0
Loan Amount: $143.0
Property Area: Urban Area

Model Prediction ('Y' = Approved, 'N' = Denied): N
Loan Approval Decision: Denied

The final piece of the puzzle merges Chatgpt and Langhcain to build an agent that acts based on natural language instruction.. This results in a comprehensible loan approval decision.

Langchain and OpenAI ChatGPT Integration:

# Initialize an Agent with instructions for loan approval using Langchain and ChatGPT
agent_executor.run(
"""
    Use the loans_train table to build a loan approval model.
    The Loan_Status column is either 'Y' for 'Approved' or 'N' for 'Denied'.
    Use your model to determine if the following loan would be approved:
    Applicant: Mr Ayoola Flames
    Income: 5417.0
    Credit History: 0.0
    Loan Amount: 143.0
    Property Area: Semiurban
    Limit your reply to a short statement including the name of the customer, amount approved, income, and credit history.
"""
)

In this section, an Agent is activated through Langchain, guiding it to model the data and furnish specific details for a loan application. Behind the scenes, ChatGPT is employed to automatically generate SQL queries from natural language queries and then retrieve the necessary results. While the results are accurate, It's crucial to highlight that the Agent's response lacks detailed reasoning for the decision; it functions more as a black box, lacking explainability on why it arrived at its decision.

The following outputs illustrate the steps and actions taken to reach its decision:

Additionally, a natural language output decision is produced using Python:

'The loan for Ms. Ayoola with an income of $5000.0, a credit history of 0.0, and a loan amount of $103.0 in the Semiurban property area would be denied.'

In conclusion, our experiment showcases the powerful combination of machine learning models, language models like DistilBERT, and conversational agents like ChatGPT. While the agent's response might lack explicit reasoning, the holistic approach employed in this experiment emphasizes the potential of Large Language Models (LLMs) and advanced machine learning techniques in the intricate landscape of loan approval decisions. As you embark on your own experiments, let the results speak for the remarkable capabilities of LLMs.

Stay Tuned

Want to Get in touch?

The best articles, links and news related to AI delivered once a week to your inbox.