Insurance Premium Prediction Application

Overview

This is a machine learning application designed for predicting insurance premiums. The project leverages a variety of tools and frameworks to streamline data management, experiment tracking, and model deployment.

πŸ› οΈ Tools Utilized


  • DVC (Data Version Control): Used for managing and versioning data pipeline.
  • Git: Version control system for tracking code changes.
  • MLflow: Used for tracking the model training and model evaluation.
  • GitHub Actions Server: Used for continuous integration and deployment.
  • Dagshub: Facilitates MLflow experiment tracking and DVC data pipeline.

πŸ›’οΈ Machine Learning Pipeline


Data Ingestion πŸ“₯

The application ingests insurance premium data from the data/insurance.csv data path and saves it into artifacts/DataIngestionArtifacts.

Data Transformation πŸ”§

Data undergoes transformation to prepare it for model training. Transformed data and preprocessing artifacts are saved into artifacts/DataTransformationArtifacts. Preprocessors are also stored in models/.

Model Training πŸ€–

Multiple machine learning models are trained: Linear Regression, Ridge Regression, Lasso Regression, Polynomial Regression, Random Forest, Gradient Boosting, XGBoost, LightGBM, Catboost. The top 4 performing models based on training metrics are selected. Both models and associated metrics are saved into artifacts/ModelTrainerArtifacts. MLflow is used to track model parameters and metrics throughout this process.

Model Evaluation πŸ“Š

The best-performing model on test data is selected and saved into artifacts/ModelEvaluationArtifacts and models/. Model evaluation metrics are tracked using MLflow.

Streamlit App Development πŸ’»

A Streamlit application is developed to allow users to input data and receive predictions from the trained model.

img

Model Deployment πŸš€

The model is deployend on the AWS EC2 using Docker and Github Action Server.

πŸ“‹ Model tracking with MLFlow


img

πŸ–‡οΈ Data pipeline tracking with DVC


dvc_up dvc_up

πŸ“ Directory Structure


πŸ“‚.github/
└── πŸ“‚workflows/
      └── main.yaml
πŸ“‚docs/
β”œβ”€β”€ πŸ“‚docs/
β”‚     β”œβ”€β”€ index.md
β”‚     └── getting-started.md
β”œβ”€β”€ mkdocs.yml
└── README.md
πŸ“‚src/
β”œβ”€β”€ init.py
β”œβ”€β”€ πŸ“‚components/
β”‚     β”œβ”€β”€ init.py
β”‚     β”œβ”€β”€ data_ingestion.py
β”‚     β”œβ”€β”€ data_transformation.py
β”‚     β”œβ”€β”€ model_trainer.py
β”‚     └── model_evaluation.py
β”œβ”€β”€ πŸ“‚constants/
β”‚     └── init.py
β”œβ”€β”€ πŸ“‚entity/
β”‚     β”œβ”€β”€ init.py
β”‚     β”œβ”€β”€ config_entity.py
β”‚     └── artifact_entity.py
β”œβ”€β”€ πŸ“‚pipeline/
β”‚     β”œβ”€β”€ init.py
β”‚     β”œβ”€β”€ training_pipeline.py
β”‚     └── prediction_pipeline.py
β”œβ”€β”€ πŸ“‚utils/
β”‚     β”œβ”€β”€ init.py
β”‚     └── utils.py
β”œβ”€β”€ πŸ“‚logger/
β”‚     └── init.py
└── πŸ“‚exception/
      └── init.py
πŸ“‚data/
  └── insurance.csv
πŸ“‚experiment/
  └── experiments.ipynb
requirements.txt
requirements_app.txt
setup.py
app.py
main.py
README.md
implement.md
.gitignore
template.py
prediction.py
init_setup.ps1
dvc.yaml
Dockerfile
demo.py
config.json
.dockerignore
.dvcignore

πŸ“ˆ Models


  • Linear Regression
  • Ridge Regression
  • Lasso Regression
  • Polynomial Regression
  • Random Forest
  • Gradient Boosting
  • XGBoost
  • LightGBM
  • Catboost

πŸ–₯️ Installation


πŸ› οΈ Requirements:

  • Python 3.10
  • mkdocs
  • dvc
  • numpy
  • pandas
  • colorama
  • mlflow==2.2.2
  • dagshub
  • scikit-learn
  • xgboost
  • lightgbm
  • catboost
  • streamlit

βš™οΈ Setup


To reproduce the model and run the application:

  1. Clone the repository:

    git clone <repository_url>
    cd <repository_name>

  2. Set up the virtual environment and install the requirements:

    ./init_setup.ps1

  3. Execute the whole pipeline:

    python main.py
    Now run the streamlit app.

🎯 Inference demo


  1. Run the Streamlit app:

    streamlit run app.py 2. Enter the input values and get prediction

Contributors πŸ‘¨πŸΌβ€πŸ’»


  • Ravi Kumar