CN112085053A

CN112085053A - Data drift discrimination method and device based on nearest neighbor method

Info

Publication number: CN112085053A
Application number: CN202010749770.1A
Authority: CN
Inventors: 李锐; 金长新
Original assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Current assignee: Shandong Inspur Scientific Research Institute Co Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2020-12-15
Anticipated expiration: 2040-07-30
Also published as: CN112085053B

Abstract

The application discloses a data drift judging method and device based on a nearest neighbor method, which are used for solving the problems that a large amount of computing power is required to be consumed, the scheme is complex and the operation is difficult to realize in the existing data drift judging algorithm. The method comprises the following steps: the server acquires a standard reference data set; the server acquires a test data set; the server judges the similarity between the data to be tested and the standard reference data set and the similarity between the data to be tested and the test data set based on a nearest neighbor algorithm aiming at each data to be tested in the test data set; and the server judges whether the test data group has data drift or not according to the similarity judgment result of each to-be-tested data in the test data group.

Description

Data drift discrimination method and device based on nearest neighbor method

Technical Field

The present invention relates to the field of concept drift, and in particular, to a data drift determination method and apparatus based on a nearest neighbor method.

Background

With the popularization and development of network application, data of various industries are continuously generated in a data stream mode, and the data have the characteristics of mass and rapid change. For example, in the industrial field, sensors need to constantly collect new data; in the e-commerce field, merchants need to continuously acquire behavior data of users.

For the same subject, data acquired at different times are referred to as time series data, which can be used to describe the time-varying condition of the subject. However, in many areas, the data distribution may change unpredictably over time, resulting in data drift that may render existing data models inapplicable to new data. Therefore, in order to select an appropriate data model, a data analyst needs to determine whether there is data drift in the data.

At present, an algorithm for judging whether data drift occurs exists, and the algorithm is a three-branch decision tree concept algorithm. In the detection process, the training data is classified by using a decision tree, and then the training data are classified into an L domain, an R domain and an M domain of three decisions according to the classification error rate of each subtree. The L domain, the R domain and the M domain respectively represent that data do not drift, data drift and data drift possibly.

However, the existing algorithms for judging data drift, including the three-branch decision tree concept algorithm, often have the problems of large consumption of computing power, complex scheme and difficult operation.

Disclosure of Invention

The embodiment of the application provides a data drift judging method and device based on a nearest neighbor method, and aims to solve the problems of large calculation amount, complexity and impracticality of the existing data drift judging method.

In one aspect, an embodiment of the present application provides a data drift discrimination method based on a nearest neighbor method, where the method includes:

the server acquires a standard reference data set;

the server acquires a test data set;

the server judges the similarity between the data to be tested and the standard reference data set and the similarity between the data to be tested and the test data set based on a nearest neighbor algorithm aiming at each data to be tested in the test data set;

and the server judges whether the test data group has data drift or not according to the similarity judgment result of each to-be-tested data in the test data group.

In one example, the standard reference data set is generated at an earlier time than the test data set.

In one example, before the server obtains the test data set, the method further comprises: the server determines a test data window for storing the test data set.

In one example, the server determines, for each data to be tested in the test data set, similarity between the data to be tested and the standard reference data set and similarity between the data to be tested and the test data set based on a nearest neighbor algorithm, including: the server calculates the distance between the data to be tested and each data in the standard reference data set and the distance between the data to be tested and each remaining data in the test data set; selecting front K pieces of data closest to the data to be tested based on the distance between the data to be tested and each piece of data in the standard reference data group and the distance between the data to be tested and each piece of remaining data in the test data group, wherein K is a preset parameter; and judging the similarity of the data to be tested with the standard reference data set and the data to be tested based on the K pieces of data.

In one example, the preset parameter K is an odd number.

In one example, the server determines similarity of the data to be tested with the standard reference data set and the data to be tested based on the K pieces of data, including: determining the number of data belonging to the standard data group in the K pieces of data as a first number; determining the number of data belonging to the test data group in the K pieces of data as a second number; if the first number is greater than the second number, the data to be tested is similar to the standard reference data set; and if the first number is smaller than the second number, the data to be tested is similar to the data group to be tested.

In one example, the determining, by the server, whether data drifting occurs in the test data group according to a result of determining similarity of each piece of data to be tested in the test data group includes: determining the number of data to be tested in the test data group similar to the standard reference data group as a third number; determining the number of data to be tested in the test data group, which is similar to the test data group, as a fourth number; if the third number is greater than the fourth number, the test data group has no data drift; and if the third quantity is smaller than the fourth quantity, the test data has data drift.

In one example, the server calculating the distance of the data to be tested from each data in the standard reference data set and the distance of the data to be tested from each remaining data in the test data set comprises: calculating the distance between the data to be tested and each data in the standard reference data set and the distance between the data to be tested and each remaining data in the test data set based on an Euclidean distance formula; the Euclidean distance formula is as follows:

wherein D (x, y) represents the distance between the data to be tested and the corresponding data, (x)₁，y₁) Coordinates representing the data to be tested, (x)₂，y₂) Coordinates representing the respective data.

In one example, the method further comprises: and if the test data group drifts, sending a data drifting result to corresponding edge equipment so that the edge equipment performs corresponding data processing on the test data group.

On the other hand, an embodiment of the present application further provides a data drift determination device based on a nearest neighbor method, where the device includes:

the first acquisition module is used for acquiring a standard reference data set;

the second acquisition module is used for acquiring the test data set;

the first judgment module is used for judging the similarity between the data to be tested and the standard reference data set and the similarity between the data to be tested and the test data set on the basis of a nearest neighbor algorithm aiming at each data to be tested in the test data set;

and the second judging module is used for judging whether the test data group has data drift or not according to the similarity judging result of each to-be-tested data in the test data group.

The data drift distinguishing method and device based on the nearest neighbor method provided by the embodiment of the application at least have the following beneficial effects: whether the test data set drifts or not is judged through the KNN algorithm, the implementation method is simple and efficient, the comprehension is easy, parameters do not need to be estimated, and the calculation power consumption is low. The design of the standard reference data set can increase the stability and robustness for judging whether the test data set has data drift. Meanwhile, the method can be used in edge equipment and combined with a sensor, can find the change of data at the first time and carry out corresponding data processing on the data in time.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of a data drift determination method based on a nearest neighbor method according to an embodiment of the present application;

fig. 2 is a schematic diagram of the KNN algorithm provided in the embodiment of the present application;

fig. 3 is a schematic structural diagram of a data drift determination device based on a nearest neighbor method according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The technical solutions proposed in the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a data drift determination method based on a nearest neighbor method according to an embodiment of the present application, where the method includes the following steps:

s101: the server obtains a standard reference data set.

In the embodiment of the application, the server randomly acquires a piece of data from the time sequence data acquired by the acquisition device or the time sequence data pre-stored in the database as a standard reference data set. Wherein, the acquisition device can be other devices such as a sensor.

The standard reference data set is a collection of several standard reference data. The standard reference data group conforms to any statistical distribution, and can be used for judging whether the statistical distribution of the test data group is the same as that of the standard reference data group according to the statistical distribution, so as to judge whether the data drift of the test data group occurs.

The length of the standard reference data set may be set as required, which is not limited in the present application.

S102: the server obtains a test data set.

In the embodiment of the application, the server acquires the test data group from the time sequence data acquired by the acquisition device or the time sequence data stored in the database.

The test data set is a data set which needs to be judged whether data drifting occurs in the application. The test data group comprises a plurality of pieces of data to be tested. The dimensions of the data to be tested in the test data set can be set as required, which is not limited in the present application.

In one embodiment, the server may obtain the test data set and the standard reference data set having time difference from the time-series data collected by the collecting device based on a characteristic that the time-series data may change with time. And the generation time of the standard reference data group should be earlier than the generation time of the test data group in order to judge whether the test data groups belong to the same statistical distribution according to the standard reference data group of which the statistical distribution is predicted.

In one embodiment, the server may determine a window of test data prior to obtaining the set of test data. The test data window is a storage unit convenient for storing the test data set and is used for storing the test data set. Thus, the length of the test data set (i.e., the number of data to be tested included in the test data set) is the same as the length of the test data window. The length of the test data window may be set according to the length requirement of the test data group, which is not limited in the present application.

S103: and the server judges the similarity between the data to be tested and the standard reference data set and the similarity between the data to be tested and the test data set based on a nearest neighbor algorithm aiming at each data to be tested in the test data set.

In the embodiment of the application, the server judges the similarity between the selected to-be-tested data in the test data group and the standard reference data group and the test data group based on a nearest neighbor (KNN) method aiming at each to-be-tested data in the test data group in the test data window.

And comparing the piece of data with the rest data in the test data group and the data in the standard reference data group to judge the similarity of the piece of data with the standard reference data group and the test data group.

In one embodiment, the step of determining the similarity of the data to be tested to the test data set and the standard reference data set comprises:

first, the distance between the data to be tested and the remaining data in the test data set and the distance between the data to be tested and all data in the standard reference data set are calculated.

Wherein, the distance between the data to be tested and other data can be represented as the similarity between the data to be tested and the corresponding data. The closer the distance is, the higher the similarity degree of the data to be tested and the corresponding data is, and the farther the distance is, the lower the similarity degree of the data to be tested and the corresponding data is.

And secondly, sequencing the distance between the data to be tested obtained in the first step and the rest data in the test data group and the distance between the data to be tested and all data in the standard reference data group.

And thirdly, determining a preset parameter K, and selecting K pieces of data closest to the data to be tested according to the K value.

Fourthly, the similarity of the data to be tested with the standard reference data set and the test data set is judged based on the K pieces of data.

In one embodiment, the server calculates the distance of the data to be tested from each data in the standard reference data set and the distance of the data to be tested from each remaining data in the test data set based on the Euclidean distance formula.

Taking two-dimensional data as an example, the Euclidean distance formula is as follows:

wherein D (x, y) represents the distance between the data to be tested and the corresponding data, (x)₁，y₁) Coordinates representing data to be tested, (x)₂，y₂) Representing the coordinates of the corresponding data.

In one embodiment, when the server determines similarity between the data to be tested and the standard reference data group and the test data group based on the K pieces of data, the server may determine the number of data belonging to the standard reference data group in the K pieces of data as the first number, and determine the number of data belonging to the test data group in the K pieces of data as the second number.

If the first number is larger than the second number, it indicates that the number of data similar to the data to be tested in the standard reference data group is larger in the K pieces of data, and it can be considered that the similarity degree of the data to be tested and the standard reference data group is higher, and the data to be tested is similar to the standard reference data group.

If the first number is smaller than the second number, it indicates that the number of data similar to the data to be tested in the test data group is greater in the K pieces of data, and it can be considered that the similarity between the data to be tested and the test data group is higher, and the data to be tested is similar to the test data group.

If the first number is equal to the second number, it indicates that the number of data similar to the data to be tested in the test data group is the same as the number of data similar to the data to be tested in the standard reference data group in the K pieces of data, and it can be considered that the similarity between the data to be tested and the standard reference data group is the same as the similarity between the data to be tested and the test data group, and the similarity between the data to be tested and the standard reference data group and the test data group cannot be judged.

In one embodiment, the value of K is preferably odd. Therefore, the situation that the data quantity of the standard reference data group and the data quantity of the test data group in the first K data from the data to be tested are the same because the K value is even can be avoided, the similarity between the data to be tested and the standard reference data group as well as the similarity between the data to be tested and the test data group can not be judged under the situation, and the occurrence of uncertain factors is avoided.

For convenience of explanation, the present application will be described taking two-dimensional data as an example.

Fig. 2 is a schematic diagram of the KNN algorithm principle provided in the embodiment of the present application. As shown in fig. 2, the x-axis and the y-axis represent different dimensions of the data,

respectively representing a standard reference data set and a test data set,

the inner circles represent data in the standard reference data set,

the squares within represent data in the test data set and Xu represents data to be tested.

The step of judging the similarity of the data to be tested, the standard reference data set and the test data set by the server comprises the following steps:

the first step is as follows: server separately calculates X_uAnd

the distance of all points within.

The second step is that: the server obtains X in the first step_uAnd

the distances of all points within the sequence are sorted based on the Euclidean distance formula.

The third step: the server selects a preset parameter K equal to 5 and selects a distance X_uThe nearest 5 points, as indicated by the arrows in the figure.

The fourth step: judgment of X_uAnd

the similarity of (c). As can be seen from FIG. 2, with X_uOf the nearest 5 points, 4 data points belong to the standard reference data set

1 data point belonging to the test data set

The data to be tested is much similar to the data in the standard reference data set, and it can be determined that the data to be tested is similar to the standard reference data set.

S104: and the server judges whether the data drifting occurs in the test data group or not according to the similarity judgment result of each to-be-tested data in the test data group.

In the embodiment of the application, the server judges whether the test data group has data drift or not according to the similarity between each data to be tested in the test data group and the standard reference data group and the test data group.

In one embodiment, the server determines, as the third quantity, a quantity of data to be tested in the test data set that is similar to the standard reference data set. The server determines the number of data to be tested in the test data set similar to the test data set as a fourth number.

If the third number is larger than the fourth number, the number of the data to be tested which are similar to the standard reference data group in the test data group is larger than the number of the data to be tested which are similar to the test data group, and the statistical distribution of most of the data in the test data group is consistent with the standard reference data group, the data drift of the test data group does not occur.

If the third number is smaller than the fourth number, the number of the data to be tested which are similar to the standard reference data group in the test data group is smaller than the number of the data to be tested which are similar to the test data group, and the statistical distribution of most of the data in the test data group is inconsistent with the standard reference data group, the data drift of the test data group occurs.

If the third number is equal to the fourth number, it indicates that the number of the data to be tested in the test data group similar to the standard reference data group is equal to the number of the data to be tested in the test data group similar to the test data group, and it cannot be determined whether data drift occurs in the test data group.

In one embodiment, the number of test data sets collected by the server is preferably an odd number. Therefore, the situation that whether the test data groups have data drift or not due to the fact that the third number is equal to the fourth number when the number of the test data groups is an even number can be avoided, and the uncertain factors are avoided.

In one embodiment, if the test data group drifts, the server sends the data drift result to the corresponding edge device, so that the edge device can timely monitor the time series data with the data drift and timely perform corresponding data processing on the time series data. For example, the data model adapted to the time-series data is re-determined according to the change of the statistical distribution of the time-series data.

In the embodiment of the application, the server judges whether the test data set drifts through the KNN algorithm, and the implementation method is simple and efficient, easy to implement, easy to understand, free of parameter estimation and training and low in calculation power consumption.

The test data set is effectively supervised by designing the standard reference data set, the accuracy of judging whether the test data set has data drift is improved, and the stability and the robustness of judging whether the test data set has data drift can be improved.

And the method can be used in the edge device and is combined with the sensor, and the change of the data can be found at the first time.

Based on the same inventive idea, the data drift determination method based on the nearest neighbor method provided in the embodiment of the present application further provides a corresponding data drift determination device based on the nearest neighbor method, as shown in fig. 3.

Fig. 3 is a schematic structural diagram of a data drift determination device based on a nearest neighbor method according to an embodiment of the present application, which specifically includes:

a first obtaining module 301, configured to obtain a standard reference data set;

a second obtaining module 302, configured to obtain a test data set;

a first judging module 303, configured to judge, based on a nearest neighbor algorithm, a similarity between the data to be tested and the standard reference data set and a similarity between the data to be tested and the test data set for each data to be tested in the test data set;

the second determining module 304 is configured to determine whether data drifting occurs in the test data set according to a similarity determination result of each to-be-tested data in the test data set.

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A data drift discrimination method based on a nearest neighbor method is characterized by comprising the following steps:

the server acquires a standard reference data set;

acquiring a test data set;

for each data to be tested in the test data group, judging the similarity between the data to be tested and the standard reference data group and the similarity between the data to be tested and the test data group based on a nearest neighbor algorithm;

and judging whether the test data group has data drift or not according to the similarity judgment result of each to-be-tested data in the test data group.

2. The method according to claim 1, wherein the data drift discrimination method based on the nearest neighbor method,

the standard reference data set is generated at a time earlier than the test data set.

3. The method of claim 1, wherein before the obtaining the test data set, the method further comprises:

the server determines a test data window for storing the test data set.

4. The method for discriminating data drift based on nearest neighbor method according to claim 1, wherein for each data to be tested in the test data set, the similarity between the data to be tested and the standard reference data set and the similarity between the data to be tested and the test data set are determined based on nearest neighbor algorithm, comprising:

calculating the distance between the data to be tested and each data in the standard reference data set and the distance between the data to be tested and each remaining data in the test data set;

selecting front K pieces of data closest to the data to be tested based on the distance between the data to be tested and each piece of data in the standard reference data group and the distance between the data to be tested and each piece of remaining data in the test data group, wherein K is a preset parameter;

and judging the similarity of the data to be tested with the standard reference data set and the data to be tested based on the K pieces of data.

5. The nearest neighbor method-based data drift discrimination method as claimed in claim 4, wherein the preset parameter K is an odd number.

6. The method for discriminating data drift based on the nearest neighbor method as claimed in claim 4, wherein the determining the similarity between the data to be tested and the standard reference data set and the data to be tested based on the K pieces of data comprises:

determining the number of data belonging to the standard data group in the K pieces of data as a first number;

determining the number of data belonging to the test data group in the K pieces of data as a second number;

if the first number is greater than the second number, the data to be tested is similar to the standard reference data set;

and if the first number is smaller than the second number, the data to be tested is similar to the data group to be tested.

7. The method for judging data drift based on the nearest neighbor method as claimed in claim 1, wherein judging whether the test data set has data drift according to the result of judging the similarity of each data to be tested in the test data set comprises:

determining the number of data to be tested in the test data group similar to the standard reference data group as a third number;

determining the number of data to be tested in the test data group, which is similar to the test data group, as a fourth number;

if the third number is greater than the fourth number, the test data group has no data drift;

if the third number is less than the fourth number, data drift occurs in the test data set.

8. The method for discriminating data shift based on nearest neighbor method as claimed in claim 4, wherein calculating the distance between the data to be tested and each data in the standard reference data set and the distance between the data to be tested and each remaining data in the test data set comprises:

calculating the distance between the data to be tested and each data in the standard reference data set and the distance between the data to be tested and each remaining data in the test data set based on an Euclidean distance formula;

the Euclidean distance formula is as follows:

9. The method according to claim 1, wherein the method further comprises:

and if the test data group drifts, sending a data drifting result to corresponding edge equipment so that the edge equipment performs corresponding data processing on the test data group.

10. A data drift discrimination device based on a nearest neighbor method is characterized by comprising the following steps:

the second acquisition module is used for acquiring the test data set;