Implementation of K-Means and K-Medians Clustering in Several Countries Based on Global Innovation Index (GII) 2018

The Global Innovation Index (GII) is an instrument to assess the ranking of innovation capabilities of all countries. The sub-index of the GII has seven enabler pillars: Institutions, Human Capital and Research, Infrastructure, Market sophistication, Business Sophistication, Knowledge and Technology Outputs, and Creative Outputs. The k-means method and k-medians method are methods for cluster countries based on GII. Cluster 1 in k-means method consists of 48 Countries, Cluster 2 consists of 45 Countries and Cluster 3 consists of 33 Countries and has the average value of seven variables are the highest. Cluster 1 in k-medians method consists of 33 Countries and has the average value of seven variables are the highest., Cluster 2 consists of 53 Countries and Cluster 3 consists of 40 Countries. The result clustering with using k-means method and k-medians method showed that k-medians is better than k-means method because the variance value of k-medians is smaller than k-means. 


Methods
Research Variables This study used secondary data sourced from World Intellectual Property Organization-WIPO coordinating with INSEAD and Cornell University. The three institutions measured a country's level of global innovation based on seven variables,including [1]: The Institutions pillar captures the institutional framework of a country. Nurturing an institutional framework that attracts business and fosters growth by providing good governance and the correct levels of protection and incentives is essential to innovation. • Human Capital and Research Variable ( 2 ) The level and standard of education and research are activities in a country are prime determinants of the innovation capacity of a nation. This pillars tries to gauge the human capital of countries.
The infrastucture includes three sub-pillars: Information and communication technologies (ICTs), General infrastructure, and Ecological sustainability. • Market Sophistication Variable ( 4 ) The availability of credit and an environment that supports investment, access to the international market, competition, and market scale are all critical for businesses to prosper and for innovation to occur. • Business Sophistication Variable ( 5 ) The business sophistication tries to capture the level of business sophistication to assess how conducive firms are to innovation activity. • Knowledge and Technology Outputs Variable ( 6 ) This pillar covers all those variables that are traditionally thought to be the fruits of inventions and/or innovations. The first subpillar refers to the creation of knowledge. The second sub-pillar, on knowledge impact, includes statistics representing the impact of innovation activities at the micro-and macroeconomic. The third sub-pillar, on knowledge diffusion. • Creative OutputsVariable ( 7 ) The last pillar, on creative outputs, has three sub-pillars The first sub-pillar on intangible assets includes statistics on trademark applications by residents at the national office. The second subpillar on creative goods and services and the third sub-pillar on online creativity.

Stage of Research a. Cluster Analysis
Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A cluster is therefore a collection of objects which are "similar" between them and are "dissimilar" to the objects belonging to other clusters [2].
Data clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at time. Hierarchical algorithms can be agglomerative (bottom-up) or

0210107-03
divisive (top-down). Agglomerative algorithms begin with each element as a separate cluster and merge them in successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
There are two assumptions that must be fulfilled in cluster analysis, namely samples that are representative (representing the population) and there are no cases of multicollinearity between variables [3]. A representative sample can be seen from the Kaiser-Meyer-Olkin (KMO) value that is greater than 0.5 [4].
(1) The presence or absence of multicollinearity between variables can be seen from the value of Variance Inflation Factor (VIF) which is greater than 10 [5]. = Where 2 is the 2 = obtained by regressing the k th predictor on the remaining predictors.

b. K-Means Method
Clustering is a classification of similar objects into several different groups, it is usually applied in the analysis of statistical data which can be utilized in various fields, for example, machine learning, data mining, pattern recognition, image analysis and bioinformatics [6]. In general, partitioning algorithms such as K-Means and EM highly recommended for use in large-size data. This is different from a hierarchical clustering algorithm that has good performance when they are used in small size data [7].
The method of K-means algorithm as follows [8]: 1) Determine the number of clusters k as in shape. To determine the number of clusters K was done with some consideration as theoretical and conceptual considerations that may be proposed to determine how many clusters. 2) Generate K centroid (the center point of the cluster) beginning at random. Determination of initial centroid done at random from objects provided as K cluster, then to calculate the i cluster centroid next, use the following formula: : cluster centroid : the number of objects to members of the cluster : the object to-i 3) Calculate the distance of each object to each centroid of each cluster. To calculate the distance between the object with the centroid author using Euclidian Distance.
4) Allocate each object into the nearest centroid. To perform the allocation of objects into each cluster during the iteration can generally be done in two ways, with a hard K-means, where it is explicitly every object is declared as a member of the cluster by measuring the distance of the proximity of nature towards the center point of the cluster, another way to do with fuzzy K-Means. 5) Do iteration, then specify a new centroid position using equation in step 2. 6) Repeat step 3 if the new centroid position is not the same.

c. K-Medians Method
The k-medians method is the development of the k-means method. Both produce k-cluster formed by measuring the distance between the center point and each object, then each object is grouped according to the nearest center point. Both of these methods have differences, one of which is at the center of the cluster. As the name implies, k-means uses the mean (mean) and k- 0210107-04 medians using the median. Furthermore, the median is descriptive statistics which tend to be more resistant to outliers. Therefore, the use of the K-Medians method will minimize errors in the cluster. [9].
The method of K-medians algorithm as follows: 1) Determine the number of clusters In the K-median method the number k must be determined in advance and there is no specific rule in determining the number of clusters k, because sometimes the determination of the number of clusters is based on the subjectivity of the researcher. In this study, cluster number k was determined using Silhouette Coefficient. Stages of silhouette coefficient calculation [10]: • Calculate the average distance of objects with other objects in the cluster with the equation: Where, a(i) : average distance between group components i : an object in cluster A j : other objects in cluster A d(i,j) : distance betweenobject i and j • Calculate the average distance of objects with all other objects in another cluster, then take the minimum value with the equation: 2) Determine the center point (centroid) Some opinions on choosing centroids for the k-medians method are as follows: • Based on Hartigan (1975), the selection of centroids can determined based on the interval of the number of each observation [11]. • Based on Rencher (2002), the selection of centroids can be determined through the approach of one of the hierarchical methods [12]. • Based on Teknomo (2007), the selection of centroids can be randomized from all observations [13]. In this study, the centroid was chosen based on Teknomo's opinion in determining the centroid, which is to choose centroids randomly from all observation units.

3) Determine the distance of each observation unit to each centroid
In this case, distance measurements are used to place observations into clusters based on the nearest centroid. The measure of distance used in the k-medians method is Manhattan's distance [14]. Manhattan distance is a measurement based on a grid system in which the points in question are placed. The concept is that in order to move from start to end point, one of four directions must be chosen for the point to advance: up, down, left, or right. Each decision will move the start point one unit in the chosen direction. The Manhattan distance is determined by the number of decisions required for the start point to reach the end point [15]. Manhattan distance can be written as follows: ( , ) = ∑ | − | = ; = , , … , To find out which method has the best result, we can used the standard standard deviation in the cluster ( ) and the standard deviation between clusters ( ) [16]. The average standard deviation formula in the cluster ( ): = −1 ∑ =1 (10) Where: = number of clusters formed = standard cluster k th . Standard deviation formula between clusters ( ): Where: ̅ = k th cluster average, = average overall cluster. .

Results and Discussion
3.1 . Variables Description Before clusters of countries using k-medians clustering, the average, median and standard deviations of each variable are calculated first. This calculation is done to calculate the confidence interval that will be used in classifying clusters. The calculation results can be seen in Table 1.

Silhouette Coefficient Value
The value of the silhouette coefficient is obtained by using software R and shown in Table 3.2 These values show how good the grouping process and the quality of the group formed. Based on Table 2 it can see that the highest value of silhouette coefficient on each cluster is K = 3. Therefore, the study uses 3 clusters

a. Outlier Detection and Sample Representing The Population
Using R application obtained, the data has an outlier. The value of Kaiser-Meyer-Olkin Measure of Sampling Adequacy is 0.919. The KMO value of 0.919 ranges from 0.5 to 1, it can be concluded that the sample can represent the population and variables can be used for further analysis [17].
Based on Table 3, we can find out the results of grouping using the K-Means algorithm using Euclidean distance, which is in Cluster 1 consists of 48 Countries, Cluster 2 consists of 45 Countries and Cluster 3 consists of 33 Countries. Then to differentiate the cluster results that is formed, it is necessary to do profilization by calculating the average value of each variable on Table 4. The result as follows: Based on Table 4, it can be known the characteristics of each cluster. Cluster 1 has the average value of seven variables are quite high. Cluster 2 has the average value of seven variables are low, whereas Cluster 3 has the average value of seven variables are the highest.

d. Cluster results using K-Medians Clustering
Using R application to find k-medians cluster, obtained 3 clusters of countries based on the Global Innovation Index. Based on Table 5 we can find out the results of grouping using the K-Means algorithm using Euclidean distance, which is in Cluster 1 consists of 33 Countries, Cluster 2 consists of 53 Countries and Cluster 3 consists of 40 Countries. Then to differentiate the cluster results that is formed, it is necessary to do profilization by calculating the average value of each variable on Table 3.4. The result as follows: Based on Table 6 72897 So the ratio value of standard deviation in cluster and between clusters using k-medians method is: From the results of all clusters using the K-means and K-Medians methods, cluster validation

0210107-09
is sought for both methods using cluster variance values. the cluster variance value will get better when the value gets smaller.

Conclusion
The results of the study provided the conclusion based on the analysis that had been carried out, 3 clusters were formed on each method. Cluster 1 in k-means method consists of 48 Countries, Cluster 2 consists of 45 Countries and Cluster 3 consists of 33 Countries. Based on the average value, Cluster 1 has the average value of seven variables are quite high. Cluster 2 has the average value of seven variables are low, whereas Cluster 3 has the average value of seven variables are the highest. Furthermore, Cluster 1 in k-medians method consists of 33 Countries, Cluster 2 consists of 53 Countries and Cluster 3 consists of 40 Countries, and Cluster 1 has the average value of seven variables are the highest. Cluster 2 has the average value of seven variables are quite high, whereas Cluster 3 has the average value of seven variables are low. From the research that had been done, the result clustering with using k-means method and k-medians method showed that k-medians is better than k-means method because the varians value of k-medians = 0.297 is smaller than k-means = 0.305.