ISSN 0798 1015

logo

Vol. 38 (Nº 59) Year 2017. Page 27

ClusCTA-MEWMAChart: A new clustering-based technique to detect Concept Drift in the presence of noise

ClusCTA-MEWMAChart: Una nueva técnica basada en clustering para detectar concept drift en presencia de ruido

Sonia JARAMILLO-VALBUENA 1; Jorge Mario LONDOÑO-PELÁEZ 2; Sergio Augusto CARDONA 3

Received: 15/08/2017 • Approved: 12/09/2017


Content

1. Introduction

2. Related word

3. ClusCTA-MEWMACHART

4. Implementation and method

5. Conclusion

Bibliographic references


ABSTRACT:

Many real-world applications generate data streams. Typically, data evolves over time and must be processed on-the-fly without the need for long term storage or reprocessing. In machine learning, the inherent variability or change over time of streams is referred to as concept drift. This phenomenon creates new challenges not present in classical machine learning techniques. In this paper, we present a new clustering-based technique to detect Concept Drift on data streams, named ClusCTA-MEWMAChart. We compare our algorithm experimentally with 4 different methods to detect concept drift from data streams and determine their robustness in the presence of noisy data. We conducted a set of experiments on synthetic datasets. The results show that the proposed approach has good performance.
Key words Concept drift; Classification; Data Stream Mining; adaptive learning

RESUMEN:

Muchas aplicaciones del mundo real generan grandes cantidades de datos que son continuos. A esto se le conoce como streams de datos. Típicamente, los streams de datos evolucionan con el tiempo. Este fenómeno, conocido como concept drift, crea nuevos desafíos que no están presentes en las técnicas clásicas de aprendizaje automático, dado que los datos deben ser procesados sobre la marcha sin la necesidad de almacenamiento a largo plazo o reprocesamiento. En este artículo, se presenta una nueva técnica basada en clusters para detectar Concept Drift en streams de datos, llamada ClusCTA-MEWMAChart. Este algoritmo es comparado experimentalmente con 4 métodos diferentes para detectar concept drift en streams de datos y se determina su solidez en presencia de ruido. Se realizan experimentos en datasets sintéticos. Los resultados muestran que el enfoque propuesto tiene un buen desempeño.
Palabras clave Concept drift; clasificación; minería sobre streams de datos, aprendizaje adaptativo

PDF version

1. Introduction

In recent years many applications generating huge volumes of data in streamed manner have become commonplace.  Network monitoring, phone records, social networks, ATMtransactions, sensor data, and biological applications are examples of sources of data streams. Streaming data pose challenging research problems in order to respond to tasks such as statistics maintenance, storage, and real time querying and pattern discovery (Gama J.). Most traditional data mining techniques have to be adapted to analyze streaming data on the fly, normally trying to avoid storing the data in the stream (Le, Stahl, Gomes, Medhat Gaber & Di Fatta, 2014). The processing when the underlying data model changes over time is difficult because it must consider the presence of noise (Fang Chu, 2005) and the evolving nature of the stream, and therefore, should be able to detect and adapt quickly to concept drifting.

The term “concept drift” means that the statistical properties of the target concept change arbitrarily over time (Widmer & Kubat, 1996) (Wang, Yu, & Han, 2010) (Minku & Yao, 2012), causing the previously constructed model to be inconsistent and requiring an update or to be replaced with a new one to prevent accuracy deterioration (Wang, Yu, & Han, 2010) (Dongre & Malik, 2014).

Kelly et al. (Kelly, Hand & Adams., 1999) indicate that depending on the nature of the concept drift, it can occur in three different ways:

The prior probabilities of classes P(c1),..., P(ck) might change over time.

Class-conditional probability distributions might change P(X|ci), i = 1,…, k over time. This kind of concept drift is referred to as virtual drift, because it is possible that P(X|ci) change without  affecting the previous classes (e.g symmetric movement to opposite directions).

Posterior probabilities P(ci|X), i = 1,…,k might change. This kind of change is often known as real drift.

The concept drift can also be classified based onits speed. In particular the speed can be: sudden or gradual. The sudden drift (also referred to as concept change) represents irreversible and abrupt changes of samples of respective class. The gradual drift corresponds to slow and gradual changes over time (Gama, Zliobaite, Bifet, Pechenizkiy, & Bouchachia, 2014) (Dongre & Malik, 2014).

An important challenge for Concept Drift Detection algorithms is to differentiate the occurrence of outliers or noise in the data from actual changes in the target concept. The outliers refer to random deviations or statistical anomalies (Chandola, Banerjee & Kumar, 2009). There are some approaches designed to handle concept drift but only very few studies in the literature have documented the performance of these techniques in the presence of noise. 

In this paper we propose a new clustering technique to detect Concept Drift on data streams, it is referred as ClusCTA-MEWMAChart. This approach uses multiple sliding windows and centroid tracking for maintaining enough knowledge about centroid behavior, and multivariate EWMA to report the occurrence of concept drift.

We compare our method with 4 different algorithms  for concept-drift detection: the Drift Detection Method (DDM) (Gama & Rodrigues, 2004),  the Early Drift Detection Method EDDM (Baena-Garcia, et al., 2006), a drift detection method based on Geometric Moving Average Test (Roberts, 2000) and the Exponentially Weighted Moving Average Chart Detection Method EWMAChartDM (Del Castillo, 2001) (Ross, Adams, Tasoulis & Hand, 2012) (Page, 1954) (Mouss & Sefouhi, 2004) .

The paper is organized as follows. Section 2 reviews related works on concept drift detection. Section 3 describes the technique we propose. Section 4 shows implementation details and model evaluation. The last section provides a summary of the findings in this work.

2. Related word

There are several approaches designed to detect concept drift.   In this section we describe 4 known algorithms in the literature: DDM, EDDM, GeometricMovingAverageDM and EWMAChartDM.

2.1. Drift Detection Method (DDM)

2.2. Early Drift Detection Method (EDDM)

2.3. GeometricMovingAverage (GMA) Detection Method

2.4. Exponentially Weighted Moving Average Chart Detection Method

3. ClusCTA-MEWMACHART

In this section, we introduce a new technique based on clustering and multivariate EWMA to detect Concept Drift on data streams. The method proposed is referred as ClusCTA-MEWMAChart. ClusCTA-MEWMAChart is a component of ClusCTA. ClusCTA is a new clustering method.

3.1. Clustering based on Centroid Tracking (ClusCTA)

Jaramillo et al. proposed ClusCTA. ClusCTA uses multiple sliding windows and centroid tracking for maintaining enough knowledge about centroid behavior.  At the beginning, ClusCTA constructs a clustering model by running 5 times the k-Means++ algorithm on the first arrival sample window. It keeps the best of the five models, and based on the estimated radius of the obtained clusters, configures MCOD (Tran, Fan & Shahabi, 2016) (Kontaki, Gounaris, Papadopoulos, Tsichlas & Manolopoulos, 2011) to perform an outlier or noise filtering process.  To conclude the initialization phase, ClusCTA invokes ClusTree on the filtered window and it returns the clustering model of the initial sample window. ClusTree (Kranen, Assent, Baldauf & Seidl, 2011) is a compact and self-adaptive index structure, which maintains stream summaries and reports novelty and outliers. The application of the ClusTree on the filtered data allows obtaining a set of good quality initial centroids. It is important for ClusCTA to start from a good quality set of centroids (even if there is noise in the data stream).

The first centroids calculated are stored at an array of current ccentroids. After this, ClusCTA maintain a sliding window of samples per clusterwith a maximum sizeofn/k. When a new instance arrives, it is processed individually, associated with a cluster (using aNearest Neighbor Criteria)andrecorded in the sliding window of the corresponding cluster. Additionally, ClusCTA have a motion window per cluster (of size m). In this ClusCTA record thearrival time of the new sample and the impact that it has on the movement of its centroid. The computation of the tracking model uses the samples stored in the centroid motion window.

3.2. ClusCTA-MEWMAChart

4. Implementation and method

We implement ClusCTA-MEWMAChart on top of the Massive On-line Analysis MOA framework (The University of Waikato, 2015), a widely known framework for data stream mining written in Java.

RandomRBFGeneratorEvents is a generator based on the random Radial Basis Function that adds driftto samples in a stream(Bifet A. , 2012).  The random radial basis function, described in (Bifet, Holmes, Pfahringer, & Gavalda, 2009)   generates a fixed number of random centroids. Each centroid has a random position, a class label, a standard deviation and a weight. The instances are generated by selecting a centroid at random. For this process, the weights of the centroids are taken into consideration, so centroids with larger weight are more likely to be chosen. The chosen centroid determines the class label of the instance. The random radial basis function gives rise to a normally distributed hyper sphere of instances enclosing each centroid.  Drift is added by moving the centroids at a constant rate.

For the evaluation with DDM, EDDM, GeometricMovingAverageDM and  EWMAChartDM, we use DriftDetectionMethodClassifier (Baena, 2012), a class for detect concept drift with a wrapper on a classifier.  The chosen classifier is a Naive Bayes classifier and the concept Drift Detection Technique can be any of the 4 methods (DDM, EDDM, GeometricMovingAverageDM and  EWMAChartDM. ClusCTA-MEWMAChart does not use DriftDetectionMethodClassifier.

We configure the RandomRBFGeneratorDrift, so that it creates five centroids of which 2 have no movement (speedOption=0) and 3 do have. In a same execution, a cluster with movement can change its speed (speedOption may take 3 different values: 0.01/500, 0.001/500 and  0.1/500).

In data streams, it is possible to have imbalanced classes, and the accuracy can be misleading   when this happens. In this case, it may be desirable to have a model with a lower accuracy but with greater predictive power (Brownlee, 2014). For this reason, to assess the accuracy of the methods to identify concept drift, in addition to  accuracy,  we  use Youden's J statistic.

Table 1
The Parameters used for evaluation

Concept drift  detection method

 

Parameters used for evaluation

ClusCTA-MEWMAChart

λ=0.10

h =23%

WMcovariance=200  (movements of each centroid to calculate the covariance matrix of the Zi  per cluster)

Wpolynomial=350 (sliding window of 350 samples  to calculate the polynomial regression model). 

EWMAChartDM

λ=0.10

minNumInstancesOption=30 (minimum number of stored instances before permitting the detection of a change) 

DDM

minNumInstancesOption=30

EDDM

minNumInstancesOption=30

Geometric Moving Average Test

minNumInstancesOption=30

h=1

α=0.99

To identify the best parameters for each concept drift detection method, we try several values. We use multiobjective optimization to obtain the best parametrization. For this purpose, we select the configuration by maximizing 2 classification quality indices, accuracy and Youden's J. The parameter MinNumInstancesOption did not show influence quality assessment of the methods.  The best configuration of each method was selected to compare them against each other method. The results obtained are summarized in Figures 1-5.

Figure 1
Performance comparison for concept drift detection, noise level=0%

Source: Own image

-----

Figure 2
Performance comparison for concept drift detection, noise level=5%

Source: Own image

-----

Figure 3
Performance comparison for concept drift detection, noise level=10%

Source: Own image

-----

Figure 4
Performance comparison for concept drift detection, noise level=15%

Source: Own image

-----

Figure 5
Performance comparison for concept drift detection, noise level=20%

Source: Own image

Figures 1 to 5 report the experimental result for noise level between  0 %  and 20%. From the observation, we can find that the classification accuracy and Youden’s J is always higher with ClusCTA- MEWMAChart, than with the 4 others methods of drift detection.

When ClusCTA- MEWMAChart detects drift, a new model is induced by applying the clustering algorithm ClusTree in the samples stored.

5. Conclusion

In this paper, we present ClusCTA-MEWMAChart  as  a concept drift detector on centroids, and compare this with other similar approximations with respect to how they performed in the presence of concept drifts, with different velocities of change. ClusCTA-MEWMAChart  shows good performance and the centroid tracking proves to be effective to react to concept drift. ClusCTA- MEWMAChart was the best concept drift detection method considering both the accuracy and Youden’s J.

ClusCTA-MEWMAChart is designed to detect gradual changes, that is, slow and gradual changes over time. Therefore, how to modify our method to detect abrupt changes is a subject of future research.

Future work includes to extend ClusCTA-MEWMAChart   to consider novelty detection (new emergent classes) in data streams scenarios and online class imbalance learning.  The process of concept drift detection in imbalanced data is a more complicated task than in the balanced classes, leading to a degradation in performance (Wang, Minku, Ghezzi, Caltabiano & Tino, 2012).

Bibliographic references

BAENA-GARCIA, M., CAMPO-AVILA, J. D., FIDALGO, R., BIFET, A., GAVALDA, R., & MORALES-BUENO, R. (2006). Early drift detection method. Fourth International Workshop on Knowledge Discovery from Data Streams.

BASSEVILLE, M. & NIKIFOROV, I. V. (2012). Detection of Abrupt Changes. Retrieved  march  2016, from ftp://ftp.irisa.fr/local/as/mb/k11.pdf

CHANDOLA, V., BANERJEE, A., & KUMAR, V. (2009). Anomaly Detection: A Survey. ACM Comput. Surv., 41(3), 15:1-15:58. From http://doi.acm.org/10.1145/1541880.1541882

DEL CASTILLO, E. (2001). Some properties of ewma feedback quality adjustment schemes for drifting disturbances. Journal of Quality Technology, 33(2).

DONGRE, P. B., & MALIK, L. G. (2014). Stream Data Classification and Adapting to Gradual Concept Drift. International Journal of Advance Research in Computer Science and Management Studies, 2 (3).

DONGRE, P., & MALIK, L. (2014, Feb). A review on real time data stream classification and adapting to various concept drift scenarios. Advance Computing Conference (IACC), IEEE International, (pp. 533-537).

FANG CHU, A. (2005). Mining Techniques for Data Streams and Sequences. University of California at Los Angeles.

GAMA, J. (2010). Knowledge Discovery from Data Streams. Chapman & Hall/CRC. Retrieved  march of 2016

GAMA, J. M., & RODRIGUES, P. (2004). Learning with drift detection. Lecture Notes in Computer Science 3171.

GAMA, J., ZLIOBAITE, I., BIFET, A., PECHENIZKIY, M., & BOUCHACHIA, A. (2014). A Survey on Concept Drift Adaptation. (ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: January 2013).

HOTELLING, H. (1931). The generalization of Student's ratio. Annals of Mathematical Statistics, 2 (3): 360–378.

JARAMILLO, S., LONDOÑO, J.-M., & CARDONA, S.-A. (s.f.). Performance Evaluation of Concept Drift Detection Techniques in the presence of noise. Revista Espacios, Vol.38(38)2017.

KELLY, M., HAND, D. J., ADAMS & M., N. (1999). The impact of changing populations on classifier performance. KDD, 367–371.

KONTAKI, M., GOUNARIS, A., PAPADOPOULOS, A. N., TSICHLAS, K., & MANOLOPOULOS, Y. (2011). Continuous monitoring of distance-based outliers over data streams. Proceeding ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, (págs. 135-146).

KRANEN, P., ASSENT, I., BALDAUF, C., & SEIDL, T. (2011). The ClusTree: Indexing Micro-clusters for Anytime Stream Mining. Journal Knowledge and Information Systems, 29(2), 249-272. From http://dx.doi.org/10.1007/s10115-010-0342-8

LE, T., STAHL, F., GOMES, J. B., MEDHAT GABER, M., & DI FATTA, G. (2014). Computationally Efficient Rule-Based Classification for Continuous Streaming Data. Springer International Publishing Research and Development. Switzerland. doi:10.1007/978-3-319-12069-0_2

LOWRY, C. A., WOODALL, W. H., CHAMP, C. W., & RIGDON, S. E. (1992). A Multivariate Exponentially Weighted Moving Average Control Chart. Technometrics, 34(1), 46-53. From http://dx.doi.org/10.2307/1269551

MINKU, L., & YAO, X. (2012). DDD: A New Ensemble Approach for Dealing with Concept Drift. Knowledge and Data Engineering, IEEE Transactions on, 24(4), 619-633.

MOUSS, H., MOUSS, D., MOUSS, N., & SEFOUHI, L. Test of Page-Hinkley, an Approach for Fault Detection in an Agro-Alimentary Production System. Proceedings of the 5th Asian Control Conference. Vol.2, pages: 815-818, 2004.

NASON, G. P. (2006). Stationary and non-stationary time series. H. Mader; S.C Coles. The Geological Society.

NIST SEMATECH. (2014). Multivariate EWMA Charts. Retrieved  el 19 de oct de 2016, de http://www.itl.nist.gov/div898/handbook/pmc/section3/pmc343.htm

PAGE, E. S. Continuous Inspection Schemes. Biometrika, 41:100-115, 1954.

PATEL, A. K., & DIVECHA, J. (2013). Modified MEWMA Control Scheme for an Analytical Process data. Global Journal of Computer Science and Technology Software and data Engeneering, 13(3).

POLLOCK, D. (2007). Regression Analysis. Retrieved  el 02 de december de 2015, de http://www.le.ac.uk/users/dsgp1/COURSES/THIRDMET/MYLECTURES/2MULTIREG.pdf

ROBERTS, S. W. (2000). Control chart tests based on geometric moving average. Technometrics, 42(1):97.

ROSS, G. J., ADAMS, N. M., TASOULIS, D. K. & HAND, D. J. (2012). Exponentially Weighted Moving Average Charts for Detecting Concept Drift. Pattern Recognition Letters, 33(2) 191-198, 2012.

ROSSI, P. E., ALLENBY, G. M., & MCCULLOCH, R. (2012). Bayesian Statistics and Marketing. John Wiley & Sons.

TRAN, L., FAN, L., & SHAHABI, C. (2016). Distance based Outlier Detection in Data Streams. Proceedings of the VLDB Endowment, Vol. 9, No. 12.

WANG, H., YU, P., & HAN, J. (2010). Mining Concept-Drifting Data Streams. En O. Maimon, & L. Rokach (Edits.), Data Mining and Knowledge Discovery Handbook (págs. 789-802). Springer US.

WANG, H., YU, P., & HAN, J. (2010). Mining Concept-Drifting Data Streams. In O. Maimon, & L. Rokach (Eds.), Data Mining and Knowledge Discovery Handbook (pp. 789-802). Springer US. Retrieved from http://dx.doi.org/10.1007/978-0-387-09823-4_40

WIDMER, G., & KUBAT, M. (1996). Learning in the Presence of Concept Drift and Hidden Contexts. In d. 10.1007/BF00116900 (Ed.), Journal Machine Learning archive, Volume 23 Issue 1, (pp. 69-101). USA.

YEH, A. M. (2008). EWMA control charts for monitoring high-yield processes based on                   non-transformed observations. International Journal of Production Research 46 (20), 5679-5699.


1. PhD in Engineering. Computer Engineering Dept., Universidad del Quindío, UQ. Armenia, (Colombia), sjaramillo@uniquindio.edu.co

2. PhD in Computer Science. Engineering Dept. Universidad Pontificia Bolivariana, UPB. Medellín, (Colombia), jorge.londono@upb.edu.co

3. PhD candidate in Engineering. Computer Engineering Dept., Universidad del Quindío, UQ, Armenia, (Colombia), sergio_cardona@uniquindio.edu.co


Revista ESPACIOS. ISSN 0798 1015
Vol. 38 (Nº 59) Year 2017

[Index]

[In case you find any errors on this site, please send e-mail to webmaster]

revistaespacios.com