Heart Disease Classification Using Deep Neural Network with SMOTE Technique for Balancing Data

. Heart disease is the leading cause of premature death worldwide. According to the WHO, heart disease causes about 30% of the total 58 million deaths and mainly occurs in individuals who are in their productive age. Several studies have been conducted to anticipate this heart disease. Various algorithms, methods


Introduction
One of the organs of the human body that plays an essential role in human life and functions to pump blood to transport oxygen and nutrients throughout the body is the heart.When the heart is damaged, the function of other organs will be affected.Damage to the heart can occur in the form of heart valve abnormalities, coronary arteries, or abnormalities in the heart muscle [1].This is called heart disease.However, heart disease is one type of disease that is not contagious and is the leading cause of disability and premature death worldwide.Heart disease occurs due to disturbances in the performance of the heart and blood vessels.About 44% of the cases are coronary heart disease, and the rest come from various other types of diseases [2].
According to WHO [3], In 2005, heart disease caused about 30% of a total of about 58 million deaths, and is expected to increase by about 17% between 2006 and 2015.Most of the deaths caused by heart disease occur in individuals under the age of 70 and are still in their productive years.Many people do not realize that they are experiencing heart disease due to a lack of knowledge about their heart health condition.This can happen to anyone, including individuals who do not show symptoms of heart disease [1].Various studies have identified several risk factors associated with heart disease, including age, gender, high blood pressure, obesity, peripheral arterial disease, socioeconomics, and diabetes mellitus.In Indonesia, the main risk factors involved in heart disease are high blood pressure, emotional and mental health problems, and diabetes mellitus [4].Socioeconomic impacts such as increased treatment costs, long treatment duration, and additional examinations required during the treatment process make the importance of prevention through early detection and control indispensable [2].
One way to perform early detection of heart disease is by implementing deep learning.In recent years, deep learning has experienced rapid development.This is because deep learning algorithms can learn complex features, recognize more complicated and abstract patterns, and overcome problems that are difficult to solve by traditional methods [5].Classification, computer vision, and pattern recognition often utilize deep learning algorithms [6].Pooja Rani, Rajneesh Kumar, and Anurag Jain conducted research by creating an intelligent system for diagnosing heart disease and utilizing the Deep Neural Network algorithm.The research focuses on using a Regularized Deep Neural Network (Reg-DNN) given dropout and L2 Regularization for regularization purposes.This research resulted in the best accuracy of 94.79% [7].
The research was conducted by Wiharto, Esti Suryani, Sigit Setyawan, and Bintang Pe Putra using the Z-Alizadeh Sani dataset, which was accessed online.This research uses a Deep Neural Network where a feature selection model is applied that considers the examination cost.Using five features, an AUC of 93.7% was obtained, an Accuracy of 87.7%, and a Sensitivity of 87.7% [8].Furthermore, in 2021, research was conducted using deep neural networks to predict heart disease by combining embedded feature selection, LinearSVC, and Deep Neural Network methods and using a dataset sourced from Kaggle containing 14 attributes and 1025 records.The designed model produces an accuracy of 98.56%, recall of 99.35%, precision of 97.84%, F1-score 0.983, and AUC 0.983 [9].
Based on several studies that have been conducted, early detection of heart disease using a deep neural network is proposed.This study compares the performance results of the Deep Neural Network algorithm before and after applying SMOTE optimization in classifying heart disease.It is expected that the SMOTE technique used can overcome data imbalance and produce the best accuracy for early detection of heart disease.

Research Flow
This research is carried out with several stages of methods, such as grooves or research steps to run according to the initial objectives.This research compares the performance results of the Deep Neural Network algorithm before and after optimization using SMOTE.The research flow design can be seen in Figure 1.
In Figure 1, there are eight stages carried out by this research, starting from data preparation, data preprocessing, EDA (Exploratory Data Analysis), splitting data into training data and test data, applying SMOTE to balance data, normalization, data classification using Deep Neural Network algorithm, and performance evaluation.

Data Preparation
At this stage, the Cleveland dataset was retrieved from the machine learning repository at the University of California, Irvine (UCI) [10].It comprises 76 attributes, but some related studies use 14 characteristics to classify heart disease [11].This study used 14 attributes, including age, gender, chest pain type, patient's blood pressure, cholesterol level, maximum heart rate, fasting blood sugar, exercise-induced angina, resting ECG, ST depression, ST slope, thalassemia, number of blood vessels stained by fluoroscopy, and target [12].Detailed information about the dataset characteristics is described in Table 1.

Data Preprocessing
The data pre-processing stage is for cleaning data from missing values, separating features and targets, and mapping values on the target variable y.The data cleaning stage is carried out because some rows have missing values.Removing missing values can improve model accuracy and prevent bias in analysis [13].Therefore, a method is applied to remove rows containing at least one NaN value from the data frame.Furthermore, after the rows containing missing values are deleted, clean data will be accommodated in the new DataFrame.
The next stage separates features and targets to prepare data that has a role as input (features) and output (targets).The features are needed as input to train the model, while the target is used as the value the model wants to predict.This separation produces variable X with 13 features as input and variable y with one target as output.Furthermore, the target will go through value mapping to facilitate the heart disease classification process.
The purpose of value mapping is to reduce the value of the previous target class from 5 target classes, namely 0, 1, 2, 3, and 4, to 2 target classes, namely 0 as not heart disease and one as heart disease.Classes that were previously 1, 2, 3, and 4 will shrink to class 1 as heart disease, and class 0 will remain as class 0. This stage is carried out to facilitate the classification process of heart disease [14].

Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an analytical process to analyze and understand data before building a heart disease classification model.This is important in heart disease classification, as it provides insight into the relationships between features, data characteristics, distributions, and patterns [15].The EDA used in this stage includes univariate analysis of the target variable to provide a visual image of the proportion or distribution of each class data in the target variable.This visualization can be used to see how balanced or unbalanced the distribution between healthy and sick classes is in the dataset.In addition, a correlation metric analysis between features is used.The aim is to provide a more detailed visual picture of the relationship between features in the dataset.

Split Data
This stage aims to divide the dataset into two different subsets.Generally, the dataset is divided into three subsets: training, test, and validation.However, this research divides the dataset into two subsets: the training and the test.Data division uses the train_test_split function from the Scikit-learn (sklearn) library.The ratio is 80:20, where 80% is used to train the model and 20% to test the model's performance.Data splitting is essential in the heart disease classification process to avoid overfitting and evaluate model performance objectively.

Synthetic Minority Over-sampling Technique
SMOTE is one of the oversampling techniques used for handling class imbalance in datasets.Class imbalance occurs when the number of samples in one class is much less compared to other courses.SMOTE works by selecting instances from the minority class and finding the k-nearest neighbors for each instance.Next, it combines the selected instances to generate synthetic cases.The application of SMOTE can overcome the problem of overfitting and help improve model accuracy [16].The SMOTE technique in this research is part of data preprocessing, carried out after the training and test data division.The following is the formula of SMOTE: Notes:   = Data synthesis to be created   = Data to be replicated   = Data that hisfarfrom data    = random number between 0 and 1

Normalization
Data normalization is an essential step in data pre-processing before training the model.One commonly used normalization method is StandardScaler.StandardScaler comes from the Scikitlearn library and rescales features in a dataset by subtracting the mean and then scaling it to the unit variance.In this research, StandardScaler is used to reduce the influence of outliers and improve the consistency of the feature scale after applying the SMOTE technique.Here is the formula for StandardScaler normalization:

Classification using Deep Neural Network
A deep Neural Network is a type of neural network architecture that consists of many interconnected layers.DNN training involves feedforward and backpropagation processes.The DNN structure can be divided into three layers: the input, hidden, and output.Each hidden layer has interconnected neurons [9].The DNN structure diagram in this study is shown in Figure 2. Experiments were conducted to find the best model for heart disease classification.Then, the best model was used to perform classification.The model consists of an input layer with 64 neurons, five hidden layers with different numbers of neurons, a dropout layer of 0.5, and an output layer with one neuron and a sigmoid activation function for binary classification.Each layer has a ReLU activation function and a kernel regularizer with L2 regularization that prevents overfitting.
Adam optimizer (Adaptive Moment Estimation) is an optimization method that combines the Momentum and RMSprop algorithms.Adam is the result of the SGD method, which has decreased the adaptive estimation of first and second-order moments [17] [18].This study implemented a callback function to monitor the val_loss metric during training.For ten consecutive epochs, val_loss does not decrease, the training will be stopped, and the model weights will be restored to the best weights.Details of the parameters used for the Deep Neural Network model are shown in Table 2.

Evaluation
Model evaluation is the process of measuring the performance of a pre-trained model.This is important to understand the extent to which the model can perform classification accurately.Meanwhile, the confusion matrix is a table commonly used to describe model performance by

Results
This research uses the Cleveland dataset, which consists of 14 attributes and 303 samples.The preprocessing stage was carried out by analyzing the dataset and found six rows containing missing values.After cleaning the data from missing values, the clean data amounted to 297 samples and 14 attributes.The feature and target separation stage resulted in an X variable of 13 features that acted as input to train the model and a y variable of 1 target.The details of the y variable are shown in Table 3. 13 Variable y contains five target classes, 0, 1, 2, 3, and 4, and then goes through the value mapping process.This aims to reduce the target value from 5 targets to 2 targets, namely 0 and 1.The change in variable y is shown in Figure 3.The bar chart explains that target 0 is 160 samples, and target one is 137.
The EDA stage that has been carried out results in the analysis of 2 types of analysis: univariate and multivariate.The diagram shows that the target data from the "no heart disease" sample is 53.9%.Meanwhile, the target data from the "heart disease" sample is 46.1%.It can be concluded that the dataset is not balanced.Multivariate analysis using the Pearson correlation matrix aims to find the correlation between variables.The Pearson correlation matrix is shown in Figure 5.A value close to 0 indicates no linear relationship between two variables.The feature with a low correlation with the target feature is "fbs" (fasting blood sugar).Two features are close to the value of 1, namely the features "ca" (Number of major vessels) and "thal" (Thalassemia).These two features show a strong positive linear relationship with the target feature.The data was divided with a ratio of 80% as the training subset and 20% as the testing subset.This stage produces four variables, namely X_train, X_test, y_train, and y_test.Where X_train serves as the training subset of the feature matrix, X_test as the testing subset of the feature matrix, y_train as the training subset of the target, and y_test as the testing subset of the target.This research applies the SMOTE technique to overcome the imbalance in the dataset.The results of using the SMOTE technique increased the target value of "1" from 137 samples to be equivalent to the target value of "0", which is 160 samples.This change is shown in Table 4.After applying the SMOTE technique, the data becomes balanced, and there can be changes in the distribution of feature values.Therefore, a normalization stage using StandardScaler was performed to ensure that the scale of the features in the synthetic and original data were consistent.
Table 4. Target data after SMOTE Target 0 160 1 160 Heart disease data that has gone through the SMOTE stage is then used in the heart disease classification process using the Deep Neural Network algorithm with the best model determined from several previous experiments.The model was chosen based on the accuracy of the results produced when classifying heart disease.This is compared with the classification of heart disease using a Deep Neural Network without SMOTE.The comparison before and after SMOTE can be seen in Table 5.
Table 5 Based on the comparison in Table 5. it can be explained that the accuracy results after SMOTE increased val_accuracy by 3.33%, which was previously 86.67% to 90%.But in this case, val_loss has increased by 0.1298, from 0.7710 to 0.9008.This may occur because applying SMOTE increases the number of samples in the minority class by generating synthetic samples.The model can become more complex and result in val_accuracy increasing as the model obtains a better minority class and val_loss increasing as the model becomes complex.In the heart disease classification stage using the Deep Neural Network algorithm without SMOTE, the epoch process stops at stage 20/50.This is because the callback function successfully analyzes val_loss and does not decrease after 20 epochs, and the weight is returned at the 10th epoch as an accurate result.The final stage of this research is to evaluate the model using a confusion matrix and three evaluation metrics, including precision, recall, and f1-score.The purpose of the confusion matrix is to provide insight into the model's performance in correctly or incorrectly classifying data within each class.The confusion matrices of the models without and using SMOTE are shown in Figures 8 and 9. Based on Figure 8, the DNN model with SMOTE can correctly classify 32 patients as positive and 22 as negative.In Figure 9, the model without SMOTE can correctly type 33 patients as positive and 19 as negative.
From the confusion matrix, three evaluation metrics can be calculated, namely precision, recall, and f1-score, to understand better the performance of the Deep Neural Network model in classifying heart disease.The results of the evaluation metrics of both models are shown in Table 6. of the evaluation metrics of the two models, it can be confirmed that the Deep Neural Network model, by applying SMOTE, produces a recall of 0.92, which is suitable for classifying heart disease detected as positive cases.There was an increase of 0.13 in recall when the SMOTE technique was applied.Recall or sensitivity aims to measure the ability of the model to detect actual positive cases of heart disease correctly.With this, recall is essential in the case of heart disease classification, thus minimizing the possibility of missed positive cases.

Discussion
Based on the results obtained, the validation accuracy value for the proposed method reaches 90% compared to some previous studies.Although studies [7] and [19] achieved a higher accuracy value of about 4% compared to this study, This study managed to improve the accuracy by 3.33% from before the model was SMOTE to overcome data imbalance.On the other hand, compared to [8], this study achieved higher accuracy and recall of about 2.3% using different datasets.With these promising results, the proposed method can be further developed and used in heart disease classification.In addition, this research focuses on heart disease classification with data imbalance handling using SMOTE.It does not rule out the possibility for further development using other data balancing techniques such as ADASYN, undersampling, or a combination of techniques to improve model performance in classifying minority classes.

=
. The following is the Adam optimization calculation formula: Parameter update result   = Previous update result parameter η = Learning rate v ̂t = Squared gradient of second-order moments  = Small scalar to prevent division by zero m̂ = Squared gradient of the first moment The Early Stopping technique aims to prevent the model from learning too much training data, which can lead to poor performance, and to stop the model training process early if there are signs of in the validation data

Figure 2 .
Figure 2. Diagram Structure of the Deep Neural Network

Figure 3 .
Figure 3. Visualization of y variable after value mapping

Table 2 .
Deep Neural Network model parameters

Table 3 .
Target Number of Each Class in Variable y

.
Comparison of Accuracy Results

Table 6 .
Comparison of Evaluation Metrics

4 .
ConclusionThis research was conducted on the classification of heart disease by applying the SMOTE technique to the Deep Neural Network algorithm.Before using SMOTE, the Deep Neural Network model produced an accuracy of 86.67% with a precision of 0.86, recall of 0.79, and f1score of 0.83.By applying the SMOTE technique, the data becomes balanced and successfully improves the accuracy of the Deep Neural Network model.This research has seven trials, and the best results are in the 4th trial with 90% accuracy, precision of 0.85, recall of 0.92, and f1-score of 0.88.Based on the results obtained, it can be said that the Deep Neural Network algorithm after SMOTE is superior to the algorithm without SMOTE.It is hoped that this research can be helpful for future research to understand better the advantages of SMOTE and deep neural networks in heart disease classification tasks.