Prediction of Diabetes Using Data Mining Techniques
V. Mareeswari* , Saranya R, Mahalakshmi R, Preethi E
Department of Software Engineering, School of Information Technology and Engineering, VIT University, Vellore.
*Corresponding Author E-mail: vmareeswari@vit.ac.in
ABSTRACT:
Diabetes mellitus is one of the world’s major diseases. Millions of people are affected by the disease. The risk of diabetes is increasing day by day and is found mostly in women than men. The diagnosis of diabetes is a tedious process. So with improvement in science and technology it is made easy to predict the disease. The purpose is to diagnose whether the person is affected by diabetes or not using K Nearest Neighbor classification technique. The diabetes dataset is a taken as the training data and the details of the patient are taken as testing data. The training data are classified by using the KNN classifier and secondly the target data is predicted. KNN algorithm used here would be more efficient for both classification and prediction. The results are analyzed with different values for the parameter k.
KEYWORDS: Data mining techniques, k nearest neighbor, prediction of diabetes, classification, UCI repository.
INTRODUCTION:
Diabetes mellitus also known as diabetes is a metabolic disease which is becoming more deadly nowadays. More people die because of this. The disease is caused due to the defects in insulin secretion. That is when the pancreas does not produce sufficient amount of insulin or when the cells of the body stops responding to the produced insulin. The disease is of three types: insulin-independent, insulin-dependent and gestational diabetes. The insulin-dependent is now known as Type1 DM which is due to lack of insulin production. Failure of pancreas to produce insulin results in type 1 diabetes. This is mostly seen in children so this has the traditional name called juvenile diabetes.
The insulin-independent, also known as Type2 DM, is due to failure of cells to respond to insulin produced. Lifestyle and genetics of a person has a major role here. People who have obesity (body mass index above 30) or have crossed the age of 40, or are under stress will have type 2 diabetes. It is also a hereditary disease which follows for generations. The third type which is Gestational diabetes occurs in women during the gestation period. Pregnant women will develop high blood sugar level without the previous history of diabetes. This may disappear after the delivery. It is curable if proper medical supervision is given or it may change as type 2 diabetes. Diabetes occurs in almost all age group peoples. Blood pressure, plasma glucose may also be a reason for diabetes. Diabetic patients will experience certain symptoms like frequent urination, being thirsty always, weight reduction, hunger and irritations. This may also lead to other complications like visual impairment, cardiovascular diseases and kidney diseases. The disease is mostly detected in later stages. So, detecting it early could reduce the risk of diabetes and also other diseases caused by this. Early detection is possible through data mining techniques. Data mining tools and techniques are used in various fields nowadays. The practice of examining pre-existing datasets to gather new information is known as data mining. It is known as knowledge discovery or machine learning process. There are various data mining techniques like classification, clustering, association, prediction and regression. Vast amount of data are present. These vast data sets are analyzed to discover patterns and with these patterns future events are predicted. Data mining in health care systems improves care and reduces cost. A predictive analysis has been made to detect if the person is affected by diabetes. Classification model is a supervised learning algorithm. It predicts categorical class label values by constructing a classifier. Training data are analyzed by the classification algorithm to build a classifier. The accuracy of the classification algorithm can be predicted using the test data. Two types of classifications are possible namely binary and multiclass. Binary classification is having only two target class labels. Multiclass has more than two possibilities. Among the various classification techniques here we have chosen KNN algorithm to predict the disease.
Literature survey:
The objective of ValidePhani Kumar and Laxmi Valide [5] was to analyze the performance of different classifier algorithms in data mining. The different classifiers like Naïve bayes, J48, JRip, Neural network, Decision tree, fuzzy logic etc. have been analyzed. The accuracy of the classifiers and the time taken by each classifier to predict the disease varies. The output of the classifier has been analyzed using weka 3.6.6 tool. The objective of Krati Saxena, Zubair Khan and Shefali Singh [6] was to show the accuracy and the error rates of the k Nearest Neighbor algorithm. One training data set and two test data have been selected. When KNN is applied to the datasets, the result was that the accuracy rate and the error rate increases when the k increases. The results have been evaluated in Matlab.
Dr. M. Renuka Devi, J. Maria Shyla [2] analysed different data mining techniques to predict diabetes mellitus. The diabetes data set taken when applied data mining techniques the analysis result proved that J48 classifier provides the highest accuracy than other classifiers. This was implemented using Weka and Matlab. Asha Gowda Karegowda, M.A. Jayaram, A.S.Manjunath [8] presented a cascading k means clustering and k nearest neighbor classfier for categorization of diabetic patients. In the first stage, K-means clustering is used to identify and eliminate incorrectly classified instances. In the second stage Genetic algorithm (GA) and Correlation based feature selection (CFS) is used for relevant feature extraction, where GA rendered global search of attributes with fitness evaluation effected by CFS. And in the third stage classification is done using K-nearest neighbor (KNN) by taking the correctly clustered instance of first stage as one data set and with feature subset identified in the second stage as another input for the KNN. Experimental results signify the cascaded K-means clustering and KNN along with feature subset identified GA_CFS has enhanced classification accuracy of KNN. The proposed model obtained the classification accuracy of 96.68% for diabetic dataset. In the existing methods the classification techniques are used. The accuracy rate and the error rates for different classifiers have classified for the dataset taken. The accuracy of each classifiers differed when analyzed. In the proposed method, using the same data mining classification algorithm we are going to make predictions for the data collected from the user as well as the classification for the dataset taken.
MATERIALS AND METHODS:
Dataset:
Here we make use of PIMA Indian diabetes data set. The data set [10] is taken from UCI machine learning repository. The data set consists of 9 attributes: number of times pregnant, plasma glucose concentration, diastolic blood pressure, triceps skin folds thickness, serum insulin, body mass index, pedigree type, age and class. The figure below represents the data of the dataset.
Fig1 Bar graph of the dataset taken
Here, the class label is binary classification. It has two values
· Tested positive (1) which means diabetic and
· Tested negative (0) which means non diabetic.
The figure below shows the number of person tested negative and the number of persons tested negative from the given dataset.
Fig2 Bar graph showing class labels of dataset
K-NEAREST NEIGHBOR ALGORITHM:
KNN is a simple and also a lazy learning algorithm. It is one of classification algorithms used in health care. It can be used for both classification and regression. However, it is more widely used in classification problems. The algorithm is preferred mostly for its ease of interpretation. KNN identifies all the unidentified data points at one point using the existing datasets. The general principle is that determine the k nearest neighbors using the distance measures for the data set. The majority of this decides the category of the given instance. Therefore, KNN is a distance weighted and majority voting algorithm.
Steps to compute KNN algorithm:
1. Determine the parameter k. Here, k is the number of nearest neighbors.
2.
Calculate the distance between the
test data and all the training samples using Euclidean distance.
3. Sort the distances calculated from minimum to maximum using any of the sorting techniques. Find the nearest neighbors of the parameter k.
4. Identify the target class label values of the k nearest neighbors.
5. Majority of the class labels is assigned as the prediction value of the unknown data.
Though this algorithm is easy to implement it has some drawbacks. In case of large datasets, it has to identify the distance measure of each data. So the cost of computation will be high and storage space required to store the data will also be large. To provide better results it is not clear which attribute to use. he classification algorithm can be used to calculate confusion matrix with true positive, true negative, false positive, false negative, true positive rate or recall, false positive rate, precision, accuracy and error rates.
· True positive (TP) – people actually having the disease and the prediction also has positive result.
· True negative (TN) – people actually not having the disease and the prediction also has negative result.
· False positive (FP) – people actually not having the disease but the prediction has positive result.
· False negative (FN) – people actually having the disease and the prediction also has positive result.
· TP and TN can be used to calculate accuracy rate and the error rates can be calculated using FP and FN values.
· True positive rate can be calculated as TP by total number of people having disease in reality.
· False positive rate can be calculated as FP by total number of people actually not having disease in reality.
· Precision is actually the TP/ total number of people having prediction result as yes.
PROPOSED METHOD:
Architecture diagram:
The system is designed to know the health condition of the person. The diabetes data set which exists is taken from the repository as training data. This data set is preprocessed first. The K-Nearest Neighbor [KNN] is applied to classify the dataset. Then the health care details of the person are collected and analyzed with the collected data set to predict the status. This time the same KNN algorithm is used for prediction. Using this classification technique it is easier to classify and predict the data. The diagram below shows the architecture specifications of the system.
Fig3 Architecture specification of the system
Pre-processing:
As a first step the target data must be collected before applying the data mining concepts. The datasets are preprocessed to analyze the class labels. Data cleaning removes noise and replaces missing data from the target dataset.
Training Phase:
The classification algorithm is now applied to the cleaned dataset. The algorithm now classifies correctly classified and incorrectly classified instances. The classifier accuracy is tested for different values of the parameter k.
Testing Phase:
The distance between the target data which is unknown and each instance of diabetes dataset which is known is found using Euclidean distance measure. The computed distances are sorted and closest of the target class are considered as per the parameter k. The majority of these class labels are assigned to the target variable. This predicts whether the person has diabetes or not. If the predicted value is 1 or tested positive, then the person is diabetic or if the value is 0 or tested negative the person is non diabetic.
RESULTS:
The code is implemented in Java and Weka. The output when KNN classification algorithm applied is as follows
When k=3, Cross-validation Folds=10
Table1 Confusion matrix when k=3
A |
B |
Classified as |
410 |
90 |
a=tested negative |
120 |
148 |
b= tested positive |
Table2: The result table showing classifier output when k=3
Parameters |
Values |
TP rate |
0.820 |
FP rate |
0.448 |
Precision |
0.774 |
Accuracy |
72.6563 |
Time taken to build model(seconds) |
0 |
When k=7, Cross-validation Folds=10
Table3 Confusion matrix when k=7
A |
B |
Classified as |
428 |
72 |
a=tested negative |
122 |
146 |
b= tested positive |
Table4. The result table showing classifier output when k=7
Parameters |
Values |
TP rate |
0.856 |
FP rate |
0.455 |
Precision |
0.778 |
Accuracy |
74.7396 |
Time taken to build model(seconds) |
0 |
The classified data set is now used as training data set and the values collected from the patient is target data. KNN algorithm can be applied to the datasets.
Fig4. Target data entered by the user
The distance between the training data and the target data are found. This happens in the back ground which is hidden from the user.
Fig5. Distance between the target and the training data set.
The prediction result shown to the user is shown in the diagram below.
Fig6: The result predicted using KNN classifier.
The figures below shows the prediction values for different values of k. when k=3, the three nearest neighbors for the target data are selected. Then its class label values are analyzed to find the majority of the class labels.
Fig7 prediction result when k=3
When k=7, the seven nearest neighbors and majority class label of those neighbors are found to predict the result.
Fig8 prediction result when k=7
DISCUSSION:
The results specified above in table2 and 4 shows that when k value is increased the accuracy of the classifier is also increased. Also, the figures7 and 8 shows the same result that in both the cases the predicted results are the same with same execution time. That is K nearest neighbor takes the majority of the class labels as the target data value. The result predicted when k=3 and k=7 shows that majority of the class labels value is the same. This shows the efficiency of the k nearest neighbor classifier.
CONFLICT OF INTEREST:
The authors declare no conflict of interest.
REFERENCES:
1. Classification and Diagnosis of Diabetes. American Diabetes Association, Diabetes Care Volume 39, Supplement 1, January 2016.
2. Dr. M. Renuka Devi, J. Marya Shyla. Analysis of Various Data Mining Techniques to Predict Diabetes Mellitus. International Journal of Applied Engineering Research,2016: ISSN 0973-4562 pp 727-730
3. Aishwarya Iyer, S. Jeyalatha and Ronak Sumbaly. Diagnosis of Diabetes Using Classification Mining Techniques. International Journal of Data Mining & Knowledge Management Process, Jan 2015; Vol.5, No.1.
4. Wikipedia K Nearest Neighbor Algorithm by radu-cimpeanu. Apr 2015.
5. VelidePhani Kumar, Lakshmi Valide. A data mining approach for prediction and treatment of diabetes disease. International Journal of Science Inventions Today, 2014;ISSN 2319-5436
6. Krati Saxena, Dr. Zubair Khan, Shefali singh. Diagnosis of Diabetes Mellitus using K Nearest Neighbor Algorithm. International Journal of Computer Science Trends and Technology. July- Aug 2014.
7. Divya Tomar and Sonali Agarwal. A survey on Data Mining approaches for Healthcare. International Journal of Bio-Science and Bio-Technology, 2013
8. Asha Gowda Karegowda, M. A. Jayaram, A. S. Manjunath. Cascading K Means Clustering and K nearest Neighbor Classifier for Categorization of Diabetic Patients. International Journal of Engineering and Advanced Technology, Feb. 2012:ISSN 2249-8958,Volume-1,Issue-3
9. Jiawei Han, Micheline Kambar. Data Mining: Concepts and Techniques, Morgan Kauffmann Publishers, Third Edition. July 2011.
10. https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.
Received on 17.02.2017 Modified on 30.03.2017
Accepted on 10.04.2017 © RJPT All right reserved
Research J. Pharm. and Tech. 2017; 10(4): 1098-1104.
DOI: 10.5958/0974-360X.2017.00199.8