Skip to content

rednam-ntn/hhp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HERITAGE HEALTH PRIZE REPORT

1. Author

This project is part of VEF Academy Machine Learning Course of Winter 2018 and all done by me.

2. Project Description

Heritage Health Prize (HHP) is a Kaggle challenge held in 2012 with the best score is 0.466 Root mean square logarithmic error (RMSLE). The HHP dataset is included 3 years claims of insurance payment record in US hospital with several patient health information categories. HHP challenge purpose is predicting days a patient will spend in the hospital in the next year base on claims data of the year before.

In this project, dataset is used from HHP dataset release 3, include 3 years claims, Days in hospital of year 2, 3, patient personal information, Drug count and Lab count of claims.

This project has 2 problems:

  • Using dataset in claims level and prediction in patient (MemberID) level.
  • Predict result is continuous numeric.

With these problems, the ideal to solve will follow these step:

  1. Data extraction: Extract and convert feature in claims level to member level for prediction.
  2. Feature selection: Select features for using in prediction model
  3. Tune & Train model: With features selected from previous step, choose model hyper-parameter, train and get the model result
  4. Compare Result: Compare the result with other model results, then repeat from Step 1, 2 or 3 to improve the result.

3. Model Selection

As describe in section 2, predict results is continuous numeric so the problem can be solved with Linear Regression and/or Decision Tree gradient booster.

The widely used state-of-the-art solution for a similar problem is XGBoost and have been proved outperforms other solutions.

4. Data Extraction

4.1. Detail Data Description

The dataset using include:

  1. Members.csv
  2. Claims.csv
  3. DrugCount.csv
  4. LabCount.csv
  5. DaysInHospital_Y2.csv
  6. DaysInHospital_Y3.csv

4.1.1. Members Information

Members information have 2 feature columns:

  • Sex with Male, Female and Null value. All will be binary One-Hot coded for each MemberID.
  • Age at first claim with categories value type from '0-9' age range to '69-79' age and '80+' age. These values will be converted to integer mean average, for example, '10-19' will be replaced with 15.

4.1.2. Claims Information

Claims table is claims level information with features:

  • MemberID: Patient ID
  • ProviderID: Claims provider ID
  • Vendor: Claims vendor ID
  • PCP: Claims PCP ID
  • Year: Y1, Y2 or Y3
  • Specialty: Categorical data of Specialty
  • PlaceSvc: Categorical data of service place
  • PayDelay: Number in range 0-161 with 95% percentile top-coded as string "162+"
  • LengthOfSay: Categorical data of day length up to 6 days and weeks length range up to 8 weeks. If above that length, Suppression is applied and value at SupLOS column is 1, else 0
  • DSFS: Categorical data of number of month, range from 1 to 12.
  • PrimaryConditionGroup: Categorical data of diagnostic
  • CharlsonIndex: Categorical data of CharlsonIndex
  • ProcedureGroup: Categorical data of procedures
  • SupLOS: 0 and 1 for none suppression and suppressed in Length of Stay values, respectively.

This processing step will only use Y1 data for training and Y2 for testing. Each will convert features from claims level to member level following rules:

  • ID data like Provider, Vendor and PCP will only sum the unique ID values of each MemberID.
  • Numeric data like PayDelay will convert to integer and sum for each MemberID.
  • Other Categorical data will One-Hot with value_counts based on MemberID.

Each feature will be converted and save to separated csv file.

Count Unique ID

  • Provider
  • Vendor
  • PCP

One-Hot with value_counts for categorical features.

  • Specialty
  • PlaceSvc
  • PrimaryConditionGroup
  • CharlsonIndex
  • ProcedureGroup
  • DSFS
  • LengthOfStay with SupLOS

Summing values with

  • PayDelay
  • Number of member's claims

4.1.3. DrugCount and LabCount Information

  • DrugCount is count of unique prescription drugs filled and top-coded at 7
  • LabCount is count of unique laboratory and pathology tests and top-coded at 10

Both are numeric data and will be converted to integer and sum for each MemberID in each Year.

4.1.4. Days in Hospital

This is the model label for training model. ClaimsTruncated features are dropped and sorted base on MemberID and save to csv files.

4.2. Result

Data is converted from Claim level in categories and numeric type to Member level with one-hot and summing type (for columns with large numeric value). Each column in Claims will convert and save to csv file for later use, due to long time processing.

Output dataset include:

  • MemberInfo_df.csv
  • DaysInHos_Y2.csv
  • DaysInHos_Y3.csv
  • Claims_count_Y1.csv
  • Claims_count_Y2.csv
  • Charlson_Y1.csv
  • Charlson_Y2.csv
  • DrugCount_Y1.csv
  • DrugCount_Y2.csv
  • DSFS_Y1.csv
  • DSFS_Y2.csv
  • LabCount_Y1.csv
  • LabCount_Y2.csv
  • LOS_Y1.csv
  • LOS_Y2.csv
  • PayDelay_Y1.csv
  • PayDelay_Y2.csv
  • PCPs_Y1.csv
  • PCPs_Y2.csv
  • PlaceSvc_Y1.csv
  • PlaceSvc_Y2.csv
  • PrimCondition_Y1.csv
  • PrimCondition_Y2.csv
  • Procedures_Y1.csv
  • Procedures_Y2.csv
  • Providers_Y1.csv
  • Providers_Y2.csv
  • Specialty_Y1.csv
  • Specialty_Y2.csv
  • Vendors_Y1.csv
  • Vendors_Y2.csv

The extraction is using code in Data_Prep.ipynb notebook, import input csv file and export the converted data to csv file for future features selection and tunning model.

5. Features selection

HHP dataset has strictly de-identification processing that meets the requirements of the Health Insurance Portability and Accountability Act Privacy Rule that could make some noise in the data that affect the prediction result. In the Data extraction step, there are 6 features that have been top-coded suppression at 95 percentiles that can be considered to be dropped.

Truncated features Number of Members
PayDelay top-coded at 162+ 23.447
Claims count truncated to 43 claimss per year 715
Drug Count top-coded at 7+ 5.720
Lab Count top-coded at 10+ 6.842
Length of Stay is Null and SupLOS == 1 1.161
ClaimsTruncated == 1 in label 3.971
Total Unique 30.351

The idea that any MemberID had a claim that has a feature that has been Truncated/Applied Suppression, that MemberID will be droped all claims record in Year 1 (training set).

Below is the results of 8 features selection scenario training with base hyper-parameter in both Linear Based and Tree Based

1.A. Linear No Drop

Linear Base

1.B. Tree No Drop

GBTree Base

2.A. Linear Drop PayDelay

Linear PayDelay

2.B. Tree Drop PayDelay

GBTree PayDelay

3.A. Linear Drop Claims Count

Linear ClaimsCount

3.B. Tree Drop Claims Count

GBTree ClaimsCount

4.A. Linear Drop Drug Count

Linear Drug

4.B. Tree Drop Drug Count

GBTree Drug

5.A. Linear Drop Lab Count

Linear Lab

5.B. Tree Drop Lab Count

GBTree Lab

6.A. Linear Drop SupLOS

Linear SupLOS

6.B. Tree Drop SupLOS

GBTree SupLOS

7.A. Linear Drop Claims

Linear Claims

7.B. Tree Drop Claims

GBTree Claims

8.A. Linear Drop All Scenario

Linear All

8.B. Tree Drop All Scenario

GBTree All

With 16 results above, easily conclude that XGBost with Tree based is a better solution than Linear based for this project problem.

6. Model Parameter Tunning

XGBoost with Tree based have some important parameters:

  • learning_rate: Step size shrinkage used in update to prevents overfitting. After each boosting step, learning_rate shrinks the feature weights to make the boosting process more conservative. For the same number of iterations, a small learning rate is under-fitting (or the model has "high bias"), and a large learning rate is over-fitting (or the model has "high variance"). Range: [0,1]
  • subsample: Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. Smaller subsample tends to prevent over-fitting. Subsampling will occur once in every boosting iteration. Range: (0,1]
  • colsample_by*: is the family of parameters for subsampling of columns. Smaller colsample will help to prevent over-fitting. All in range: (0,1].
    • colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.
    • colsample_bylevel is the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.
    • colsample_bynode is the subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.
  • lambda: L2 regularization term on weights. Increasing this value will make model more conservative.
  • max_depth: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit.

7. Result

Final Result

The best result is achieved at 0.469818 RMSLE score, (Rank 155 in the Kaggle Private Leaderboard) with the combination of 6 features selection scenario and following hyper-parameter after several tunings and playing around. The best parameters were chosen:

  • 'subsample': 0.3
  • 'colsample_by*': 0.3
  • 'learning_rate': 0.01
  • 'max_depth': 10
  • 'lambda': 40
  • 'num_boost_round': 300

subsample, colsample_by* is fit in a pretty small number to prevent over-fitting. max_depth is 10, is suitable for the system hardware can train in acceptable time.

learning_rate and lambda is chosen via the compares with several results.

8. Conclusion

8.1. Lessons Learnt

  • Garbage in, Garbage out: I have spent too much time to tunning parameter but forget the effects of truncated data on prediction result, that can fix by cleaning the training dataset.

  • Bias vs. Variance: You can't achieve the best bias and the best variance at the same time. To build a good model, there is a trade-off between bias and variance that we only can find a good balance point such that it minimizes the model total error.

8.2. Potential Improvement

  • Data extraction and features selection should be done in more different ways. With this project, I just applied common knowledge in data extraction and feature selections. Biomedical knowledge is really necessary to achieve a better data insight for feature engineering in other features such as Specialty, PrimaryConditionGroup and ProcedureGroup.
  • Machine Learning model using in this project is just merely XGBoost that trained with training dataset in only 4 minutes maximum. Ensemble with more model, such as Fully Connected Neural Network, might improve the result.

Reference

About

Heritage Health Prize, a Kaggle Competition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published