This Loan Default Risk Analysis project is a complete, data-driven machine learning solution designed to assess and predict the likelihood of a loan applicant defaulting on their loan. It simulates a real-world financial decision-making process by incorporating key personal and financial attributes of borrowers and applying a classification model to evaluate risk. Built using Python and widely adopted data science libraries such as Pandas, NumPy, Scikit-learn, Streamlit, and FPDF, the project covers the full ML lifecycle β from dataset creation and preprocessing to model training, evaluation, and interactive deployment via a web app. The workflow begins with the generation and loading of a synthetic loan applicant dataset, followed by cleaning, feature preparation, and the training of a Random Forest Classifier. This model is chosen for its robustness and ability to handle nonlinear relationships and feature importance analysis. To ensure interpretability and transparency, the modelβs predictions are exposed via an interactive Streamlit-based user interface, which allows users to input hypothetical applicant data (e.g., age, annual income, credit score, loan amount, loan term) and receive a real-time risk prediction. Furthermore, the app automatically generates and allows downloading of a PDF report summarizing the userβs input and the modelβs prediction β useful for documentation or stakeholder sharing.
This system simulates a practical credit scoring pipeline that could be used by lending institutions, microfinance organizations, or credit analysts to:
Pre-qualify applicants
Identify high-risk borrowers
Automate parts of the loan screening process
The model performance is evaluated using key metrics such as accuracy, precision, recall, and F1-score, giving a rounded view of how well the classifier distinguishes between defaulters and non-defaulters.
By streamlining both the predictive backend and user-facing interface, this project demonstrates the real-world application of data science and machine learning in financial risk assessment, showcasing the potential for automating decision pipelines while maintaining interpretability and user interaction.
Ultimately, this project not only highlights the power of machine learning in making informed loan decisions but also serves as a portfolio-ready showcase of technical skills in:
Data analysis
Model training and evaluation
PDF reporting
Streamlit app deployment
GitHub documentation and version control.
β οΈ Challenges Faced : -> Ensuring the model handled imbalanced class distributions effectively, where defaults are typically less frequent than non-defaults.
-> Avoiding overfitting in tree-based models like Random Forest due to a small, synthetic dataset.
-> Generating realistic synthetic data while maintaining variability and meaningful feature relationships.
-> Handling compatibility issues across Python, NumPy, and scikit-learn versions during local testing and packaging.
-> Maintaining modularity across data generation, model training, reporting, and UI components.
βοΈ Data Imbalance Considerations -> In real-world financial data, default cases are often underrepresented. To simulate this behavior:
-> The dataset was synthetically generated with a 75:25 split between non-defaulters and defaulters.
-> This helped mimic practical class imbalance and test model generalizability on minority classes.
Note: In future versions, advanced techniques such as SMOTE (Synthetic Minority Oversampling) or cost-sensitive learning could be introduced.
π Future Improvements -> Integrate SHAP or LIME for model interpretability and feature attribution.
-> Add a live database backend (e.g., SQLite, Firebase, or PostgreSQL) to log all predictions.
-> Incorporate email integration to send PDF reports directly to applicants.
-> Add user authentication for secure multi-user access.
-> Expand model training with hyperparameter tuning using GridSearchCV or Optuna.
-> Add Streamlit Cloud multi-page structure (sidebar navigation).
βοΈ Streamlit Deployment Experience The complete app was deployed on Streamlit Cloud, allowing for real-time interaction with the model through a modern, browser-accessible interface.
Deployment involved:
-> Structuring the codebase for cloud readiness (requirements.txt, fixed paths)
-> Testing compatibility across Python versions and external libraries
-> Streamlining model size, folder structure, and app performance for smooth hosting.
This project was developed with a structured approach involving:
Default
(0 = No Default, 1 = Default)
loan-default-analysis/
βββ app/
β βββ streamlit_app.py # Streamlit UI
βββ data/
β βββ loan_data.csv # Input dataset
βββ model/
β βββ loan_default_model.pkl # Trained model
βββ reports/
β βββ loan_risk_report_*.pdf # Auto-generated PDF reports
βββ visuals/
β βββ *.png # Plots (optional)
βββ main.py # Model training script
βββ generate_dummy_data.py # Script to generate synthetic data
βββ generate_report.py # PDF report generator
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
βββ .gitignore
git clone https://github.com/zufran123/loan-default-analysis.git
cd loan-default-analysis
python -m venv venv
venv\Scripts\activate # Windows
pip install -r requirements.txt
streamlit run app/streamlit_app.py
python main.py
streamlit run app/streamlit_app.py
Age
, AnnualIncome
, LoanAmount
CreditScore
, LoanTerm
Default
(0 = No, 1 = Yes)Dataset is synthetically generated for demonstration purposes.
Explore the fully interactive Streamlit application here:
This Open Source Software is licensed under the MIT License.
Please give proper credit by including the license and attributing the original author.