Stroke Classification Comparison with KNN through Standardization and Normalization Techniques

. This study explores the impact of z-score standardization and min-max normalization on K-Nearest Neighbors (KNN) classification for strokes. Focused on managing diverse scales in health attributes within the stroke dataset, the research aims to improve classification model accuracy and reliability. Preprocessing involves z-score standardization, min-max normalization, and no data scaling. The KNN model is trained and evaluated using various methods. Results reveal comparable performance between z-score standardization and min-max normalization, with slight variations across data split ratios. Demonstrating the importance of data scaling, both z-score and min-max achieve 95.07% accuracy. Notably, normalization averages a higher accuracy (94.25%) than standardization (94.21%), highlighting the critical role of data scaling for robust machine learning performance and informed health decisions.


Introduction
The implementation of AI and ML in the medical field holds revolutionary potential, enhancing the accuracy of diagnosis, treatment planning, and patient monitoring [1].Its ability to process vast medical data enables the development of innovative diagnostic tools and treatment plans that can improve patient outcomes, identify individual risks, and personalize treatment plans.[2].In the context of classifying stroke datasets, the process of data standardization and normalization demonstrates significant urgency.Standardization and normalization are crucial steps in data preprocessing that play a major role in improving the performance of classification models.Through standardization, attributes in the dataset are transformed into a uniform scale, ensuring that each variable has an equal impact in the classification process [3].Meanwhile, normalization adjusts attribute values into a more controlled range, minimizing the impact of outliers and improving the distribution of data [4].
In this study, the focus on data standardization and normalization aims to enhance the accuracy and reliability of classification models, particularly machine learning algorithms like KNN, in identifying stroke risk patterns.Both processes are crucial because health attributes in the stroke dataset exhibit diverse scales.By avoiding the dominance of large-scale attributes, standardization and normalization ensure a balanced contribution of each attribute, enabling the model to provide more accurate and consistent predictions.Through the analysis of its positive impact, this research highlights the vital role of standardization and normalization in improving prediction accuracy, providing a reliable foundation for result interpretation, and supporting more precise health decision-making.
Based on previous research related to the topic of this study, which is the Evaluation of Stroke Classification with KNN through Standardization and Normalization Techniques, there are several findings.The results obtained from the study titled 'Analysis of the Influence of Data Scaling on the Performance of Machine Learning Algorithms for Plant Identification' [5] indicate that the difference in accuracy and recall between standardization and normalization is not significantly different for the KNN machine learning algorithm.For standardization, the accuracy obtained is 76%, while the normalization result is slightly higher, with an accuracy of 77.33%.
Similar research on the comparison of data scaling, specifically standardization and normalization, is also found in a journal regarding the comparison of data normalization for wine classification using the K-NN algorithm [6].The accuracy results from Min-Max normalization are 57.41%,while for Z-score standardization, it is 56.40%.
A similar journal discussing the difference in the Evaluation of Stroke Classification with KNN through Standardization and Normalization techniques is also found in the following journal.This journal compares Min-Max normalization with Z-Score standardization to test the accuracy of Breast Cancer types using the KNN algorithm [7].The accuracy results obtained are 97% for standardization and 98% for Min-Max normalization.This study focuses on comparing the performance of the K-Nearest Neighbors (KNN) model in stroke classification.The process begins with data acquisition, followed by data preprocessing.Subsequently, the dataset is divided into training and testing data.The training data undergoes three different preprocessing conditions: Z-score standardization, Min-Max normalization, and no preprocessing (raw data).The KNN model is then trained on the preprocessed training data to identify patterns in stroke detection.Its performance is evaluated on the testing data using standard metrics such as accuracy, precision, and recall.The research aims to provide a comprehensive understanding of the impact of data preprocessing on model performance in the context of stroke classification using KNN.

Data Acquisition
The dataset used in this study, sourced from Kaggle.com ('stroke prediction'), contains 5110 rows and eleven columns.The dataset for stroke classification encompasses several attributes.The 'Id' serves as a unique identifier for each record.Gender information is provided in the 'Gender' column, categorized as 'Male,' 'Female,' or 'Other.'The 'Age' column denotes the patients' ages, while 'Hypertension' and 'Heart_Disease' are binary attributes indicating specific health conditions.Marital status is detailed in the 'Ever_married' column, and employment type is specified in 'Work_type.'The 'Residence_type' column distinguishes between rural and urban residences.Health metrics include 'Avg_glucose' for blood sugar levels and 'Bmi' for Body Mass Index.Smoking habits are described in the 'Smoking_status' column, and the 'Stroke' column indicates the stroke status (label).

Data Pre-processing
In the data preprocessing process for the stroke dataset study, crucial steps are undertaken to ensure the quality of the dataset used in stroke classification.Firstly, irrelevant attributes, such as identification numbers, are removed to simplify the dataset and focus on attributes that have a significant impact on stroke risk, such as age, blood pressure, and smoking history.Next, handling missing values becomes a primary focus, where missing values are deleted from the dataset with the consideration that their presence could affect the quality and integrity of stroke classification analysis.Finally, label encoding is performed as a technique to transform categorical data into numeric form.This allows machine learning algorithms to more effectively understand and analyze categorical variables in the dataset, ensuring accurate representation in the model.By using label encoding, data analysis and modeling become more efficient, preparing the dataset optimally for the training and prediction processes of the model.

Outlier Handling
Identifying and addressing outliers in data is crucial in the context of machine learning and predictive modeling.This action helps reduce noise, detect erroneous records, and prevent overfitting, providing insights into patterns and trends in the data.In model development, handling outliers is necessary to enhance reliability and accuracy.This study utilizes the Interquartile Range (IQR) method to identify outliers in relevant attributes.The initial steps involve calculating the first quartile (q1) and third quartile (q3), which are used to compute the IQR as the difference between q3 and q1 [8].
Outliers are identified by calculating the lower bound as q1 minus 1.5 times the IQR and the upper bound as q3 plus 1.5 times the IQR [9].
Outlier handling is performed on the numerical attributes 'bmi' and 'avg_glucose_level' using the interquartile range (IQR) method.Visualization with boxplots is used for outlier identification, and the outliers are removed from the dataset to ensure data integrity and quality.This process enhances the reliability of analysis and modeling by eliminating potentially disruptive data.After outlier handling, the dataset is reduced from 4909 to 4260 rows.

Data Spliting
In the preprocessing stage, the dataset is divided into training and testing subsets using five variations of ratios, namely 90%:10%, 80%:20%, and 70%:30%.This process aims to objectively test the stroke classification model on data not used during training, preventing overfitting, and ensuring the model's generalizability.The division is performed randomly for objectivity and data representation in both training and model validation.

Imbalance data Handling
The research employs the SMOTE technique to address imbalanced data, a method effective in classification tasks.SMOTE creates synthetic samples from the minority class by selecting reference points and generating synthetic samples through a formula incorporating a parameter δ, allowing flexibility in adjusting synthetic results to the dataset's characteristics [10].The implementation of SMOTE increases the number of stroke samples from 3408 to 6610 in the training data.

Min Max Normalization
MinMax normalization is a process to transform the range of data values into between 0 and 1 [6].The primary objective of this normalization is to ensure that all data attributes have a uniform scale, avoiding the dominance of attributes with large scales over others.By rescaling the data value range, the interpretation of analysis results becomes more consistent, and the performance of classification models can be enhanced [11], especially in the task of stroke risk classification.The MinMax normalization process is applied to each data value (xi) using the formula: Where x′ is the normalized result value of the original observation value or initial value of a data point (xi), while min(x) and max(x) represent the minimum and maximum values across all the data.

Z-Score Standardization
Z-Score standardization is a standardization technique that uses the mean and standard deviation of each feature attribute to transform the scale of data values.This standardization procedure is applied to reduce the impact of outliers and ensure that each attribute has a consistent scale.The main goal of Z-Score standardization is to enhance the stability of analysis results and improve the consistency of data interpretation [12].
The Z-Score standardization process is applied to each data value (xi)using the formula:

02401012-05
The z-score formula standardizes data by transforming its distribution to a mean of 0 and a standard deviation of 1, facilitating comparisons.It involves subtracting the observation value from the population mean and dividing by the population standard deviation.The resulting Z score indicates the value's distance from the mean in standard deviation units, crucial for statistical analysis and machine learning.

KNN
The K-Nearest Neighbors (KNN) algorithm falls into the category of supervised learning, classifying data based on their proximity or distance to other data points.In the implementation of KNN, the Euclidean distance formula is commonly used to measure the proximity between training and test data points [13].
d= √ ∑ ( − ) 2    =1 (5) The K-Nearest Neighbors (KNN) algorithm employs the Euclidean formula, where 'di' represents the distance between training and test data, 'xi' is the training data, 'yi' is the test data, 'n' is the data dimension, and 'i' is the data variable.KNN's operation involves initializing 'K', calculating distances, sorting distances, selecting the nearest 'K' neighbors, applying majority rule, and predicting the category.This method focuses on the relationships among data points in feature space, with the value of 'K' determining the number of nearest neighbors considered, a critical factor in stroke classification [14].For model development, cross-validation is used to find the optimal 'K'.The model is trained and evaluated on subsets, with accuracy recorded as an evaluation metric.The optimal 'K' is selected based on the average evaluation metric across all cross-validation iterations, ensuring the best choice for stroke classification.

Evaluate metrics
In this study, the evaluation of the classification model heavily relies on the Confusion Matrix, a powerful tool that summarizes model performance.The matrix's four main components-True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN)-facilitate the measurement of accuracy, precision, recall, and F1-score.These metrics offer comprehensive insights into the model's ability to distinguish between different classes.The confusion matrix table forms the foundation for calculating accuracy, precision, recall, and F1-score, essential in evaluating the model's effectiveness and accuracy in class prediction.[15].

Preprocessing
Several steps are taken in preprocessing the stroke dataset as follows.First, remove irrelevant attributes such as identification numbers.Second, address missing values by removing them from the dataset.Finally, apply label encoding to convert categorical data into numerical form.This process enhances the efficiency of data analysis and modeling, preparing the dataset optimally for model training and prediction.In the results of searching for the optimal K value using cross-validation and performance curves, based on the table above utilizing three ratios (90% and 10%, 80% and 20%, 70% and 30%), it is found that the best K value is at k=2.

KNN using Standardization k = 2
Here is the KNN model evaluation table for Z-Score standardization in each data splitting ratio.

Conclusion
In the study during the data preprocessing stage, two different conditions were applied.The first condition involved Z-Score standardization of the stroke classification dataset, and the second condition involved Min-Max normalization.Subsequently, training was conducted using the KNN machine learning algorithm.Based on the results and discussions, several conclusions can be drawn: a.The highest accuracy for both Z-Score standardization and Min-Max normalization is 95.07%for the 90:10 data splitting ratio.b.The average accuracy of Z-Score standardization is lower at 94.21% compared to the average accuracy of normalization at 94.25%.c.The optimal K value after cross-validation is found to be k=2 after comparing for each ratio.d.The comparison results between Z-Score standardization and Min-Max normalization with data without Z-Score standardization and Min-Max normalization reveal differences.This underscores the importance of performing data scaling on the dataset to achieve high machine learning performance.

Figure 4 .
Figure 4. comparison of the amount of data before and after SMOTE Oversampling

Table 1 .
Rasio Split Train Data and Test Data

Table 2 .
Best k value in each ratio.

Table 3 .
Classification report for KNN using Standardization Here is the KNN model evaluation table for Min Max Normalization in each data splitting ratio.

Table 4 .
Classification report for KNN using Normalization Here is the KNN model evaluation table for data without Z-Score standardization and Min-Max normalization in each data splitting ratio.

Table 5 .
Classification report for KNN without Data Scaling