Chronic KIdney Disease dataset
Context
First, I am new to ML, and just in case I slip up, apologies in advance!! So, I am doing an online ML course and this is an assignment where we are supposed to practice scikit-learn's PCA routine. Since the course has been ARCHIVED - which means the discussion posts are not answered!! - hence my posting of the problem here.
What better way to learn than to get so many experts giving me feedback … right?
Content
The data was taken over a 2-month period in India with 25 features ( eg, red blood cell count, white blood cell count, etc). The target is the 'classification', which is either 'ckd' or 'notckd' - ckd=chronic kidney disease. There are 400 rows
The data needs cleaning: in that it has NaNs and the numeric features need to be forced to floats. Basically, we were instructed to get rid of ALL ROWS with Nans, with no threshold - meaning, any row that has even one NaN, gets deleted.
Part 1: We are asked to choose 3 features (bgr, rc, wc), visualize them, then run the PCA with n_components=2. the PCA is to be run twice: one with no scaling and the second run WITH scaling. And this is where my issue starts … in that after scaling I can hardly see any difference!
I will stop here for now till I get feedback and then move to Part 2.
Acknowledgements The dataset is available at: https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease
Inspiration I would like to get an intuitive and a practical understanding of PCA.
Data Dictionary
Additional Information
Field | Value |
---|---|
Data last updated | September 21, 2024 |
Metadata last updated | September 21, 2024 |
Created | September 21, 2024 |
Format | CSV |
License | Creative Commons Attribution |
Datastore active | True |
Has views | True |
Id | 6bd22f52-9b61-4c36-9984-168a4845ce31 |
Mimetype | text/csv |
Package id | 8ab5865a-ac21-4b6b-9031-bd203fe64331 |
Position | 0 |
Size | 47.4 KiB |
State | active |
Url type | upload |