CN113360484A

CN113360484A - Data correction method and device and computer readable storage medium

Info

Publication number: CN113360484A
Application number: CN202010145446.9A
Authority: CN
Inventors: 安翔宇; 翟艳梅; 范晓旭; 周松桥
Original assignee: China Telecom Corp Ltd
Current assignee: Tianyi Cloud Technology Co Ltd
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2021-09-07

Abstract

The invention discloses a data deviation rectifying method, a data deviation rectifying device and a computer readable storage medium, and relates to the field of data processing. The data deviation rectifying method comprises the following steps: inputting a plurality of pieces of data before time corresponding to data to be corrected, which is acquired from Deep Packet Inspection (DPI) data, into a feature extraction network to obtain core features of the plurality of pieces of data, which are output by the feature extraction network; constructing a core characteristic sequence of the plurality of data according to the corresponding time sequence of the plurality of data; inputting the core characteristic sequence into a generator based on a long-term and short-term memory network trained in advance to generate predicted data; and replacing the data to be rectified with the predicted data. The invention realizes the scheme of predicting by using the time sequence information of the DPI data. The prediction mode is more in line with the characteristics of DPI data, so that the effectiveness of DPI data correction and the accuracy of corrected data are improved.

Description

Data correction method and device and computer readable storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a data error correction method, apparatus, and computer-readable storage medium.

Background

Due to uncontrollable factors such as network fluctuation, resource load, source data abnormity and the like, abnormal data can be generated in the DPI data transmission process, so that the quality of the obtained DPI data is not high, and great influence is caused on the development of data products. Aiming at the problem, the related technology adopts methods such as manual rules or clustering models, Canopy, K-means and the like to correct the DPI data.

Disclosure of Invention

The inventor finds out through analysis that the deviation rectifying result of the method used in the related art is not ideal.

The embodiment of the invention aims to solve the technical problem that: how to improve the effectiveness of DPI data correction and the accuracy of the corrected data.

According to a first aspect of some embodiments of the present invention, there is provided a data skew correction method, including: inputting a plurality of pieces of data before time corresponding to data to be corrected, which is acquired from Deep Packet Inspection (DPI) data, into a feature extraction network to obtain core features of the plurality of pieces of data, which are output by the feature extraction network; constructing a core characteristic sequence of the plurality of data according to the corresponding time sequence of the plurality of data; inputting the core characteristic sequence into a generator based on a long-term and short-term memory network trained in advance to generate predicted data; and replacing the data to be rectified with the predicted data.

In some embodiments, the feature extraction network comprises a convolutional neural network and a core feature extraction layer; the convolutional neural network extracts hidden features from the plurality of pieces of data, and the core feature extraction layer extracts core features from the hidden features.

In some embodiments, the convolutional neural network has a residual structure.

In some embodiments, the convolutional neural network is an inclusion-Resnet network.

In some embodiments, the core feature extraction layer is an attention layer.

In some embodiments, the LSTM network-based generator is a generator in a generative confrontation network, the generative confrontation network further comprising a decider; the data deviation rectifying method further comprises the following steps: inputting target data acquired from DPI data used for training and a plurality of pieces of training data before time corresponding to the target data into a feature extraction network to obtain core features of the target data and the plurality of pieces of training data, wherein the core features are output by the feature extraction network; constructing a core characteristic training sequence of a plurality of pieces of data according to the corresponding time sequence of the plurality of pieces of training data; inputting the core characteristic sequence into a generator based on an LSTM network trained in advance to generate predicted data; inputting the predicted data into a feature extraction network to obtain core features of the predicted data output by the feature extraction network; the core features of the predicted data and the core features of the target data are input into a discriminator so as to train a feature extraction network and a generative confrontation network according to the judgment result of the discriminator.

In some embodiments, the data deskewing method further comprises: determining data with empty numerical values or fields with abnormal numerical values in the DPI data as data to be corrected; the data before the time corresponding to the data to be corrected and the data to be corrected have the same field, and the numerical values are not null and are not abnormal values.

According to a second aspect of some embodiments of the present invention, there is provided a data skew correction apparatus, including: the characteristic extraction module is configured to input a plurality of pieces of data before time corresponding to data to be corrected, which are acquired from Deep Packet Inspection (DPI) data, into a characteristic extraction network to obtain core characteristics of the plurality of pieces of data output by the characteristic extraction network; the sequence construction module is configured to construct a core characteristic sequence of the plurality of pieces of data according to the corresponding time sequence of the plurality of pieces of data; the data generation module is configured to input the core characteristic sequence into a generator based on a long-short term memory network trained in advance and generate predicted data; and the deviation rectifying module is configured to replace the data to be rectified with the predicted data.

According to a third aspect of some embodiments of the present invention, there is provided a data skew correction apparatus, including: a memory; and a processor coupled to the memory, the processor configured to perform any of the foregoing data deskewing methods based on instructions stored in the memory.

According to a fourth aspect of some embodiments of the present invention, there is provided a computer-readable storage medium having a computer program stored thereon, where the program is executed by a processor to implement any one of the data skew correction methods.

Some embodiments of the above invention have the following advantages or benefits: the invention can predict according to the core characteristics of a plurality of data before the time corresponding to the data to be corrected so as to obtain the correct data which should appear at the time corresponding to the data to be corrected, thereby realizing the scheme of predicting by utilizing the time sequence information of the DPI data. The prediction mode is more in line with the characteristics of DPI data, so that the effectiveness of DPI data correction and the accuracy of corrected data are improved.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 illustrates a flow diagram of a method of data deskewing according to some embodiments of the present invention.

FIG. 2 illustrates a flow diagram of a feature extraction method according to some embodiments of the invention.

FIG. 3 illustrates a flow diagram of a training method according to some embodiments of the inventions.

FIG. 4 illustrates a block diagram of a data skew correction device according to some embodiments of the present invention.

FIG. 5 is a schematic diagram of a data skew correction apparatus according to another embodiment of the present invention.

FIG. 6 is a schematic diagram of a data skew correction apparatus according to further embodiments of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

After further analysis, the inventor finds that one characteristic of Deep Packet Inspection (DPI) data is that the Deep Packet Inspection data has the acquisition time of each field, so that the Deep Packet Inspection data has strong time sequence. Therefore, the DPI data can be rectified based on the mining of the DPI data time sequence information. An embodiment of the data deskewing method of the present invention is described below with reference to fig. 1.

FIG. 1 illustrates a flow diagram of a method of data deskewing according to some embodiments of the present invention. As shown in fig. 1, the data skew correction method of this embodiment includes steps S102 to S108.

In step S102, a plurality of pieces of data before time corresponding to the data to be corrected, which is acquired from the DPI data, are input into the feature extraction network, so as to obtain core features of the plurality of pieces of data, which are output by the feature extraction network.

In some embodiments, data with an empty value or a field with an abnormal value in the DPI data is determined as the data to be corrected. The data before the time corresponding to the data to be corrected and the data to be corrected have the same field, and the numerical values are not null and are not abnormal values. The identification of the data to be rectified can be through searching and matching, and the like, which is not described herein again.

For example, DPI data records the amount of traffic a certain handset user uses per hour. When it is found that the traffic of the mobile phone user at 10 am of a certain day is 630G, that is, the value obviously exceeds the reasonable value, or the traffic field is empty, the traffic of the mobile phone user at 10 am of the day may be used as the data to be corrected, and the traffic data of the user from 0 am to 9 am of the day on two days before the day may be used as the "multiple pieces of data" in step S102. According to the actual situation, the data to be rectified may also include a set of fields.

In step S104, a core feature sequence of the plurality of pieces of data is constructed in a corresponding time order of the plurality of pieces of data.

In step S106, the core feature sequence is input into a generator based on a Long Short-Term Memory (LSTM) network trained in advance to generate predicted data.

LSTM is a recurrent neural network. LSTM can not only predict information at the next time using information at the current time, but also acquire information at an earlier time using a Cell structure in the network. In the network behavior of the user, not only the adjacent previous and subsequent time points are related, but also the behavior of the current time point and the earlier time point is related. For example, a user may browse videos on a video website at a daily commute time, which is relatively fixed. Therefore, when the video browsing data of 8 th point on the t day of the user is rectified, the video browsing data of 8 th point on the t-1 st day also has great reference value. Therefore, by using the LSTM network, prediction data can be generated more accurately.

In step S108, the predicted data is used to replace the data to be corrected.

By the method of the embodiment, the core characteristics of a plurality of pieces of data before the time corresponding to the data to be corrected can be predicted to obtain the correct data which should appear at the time corresponding to the data to be corrected, so that the scheme of predicting by using the time sequence information of the DPI data is realized. The prediction mode is more in line with the characteristics of DPI data, so that the effectiveness of DPI data correction and the accuracy of corrected data are improved.

In some embodiments, the feature extraction network may be made up of multiple sub-networks or layers. An embodiment of the feature extraction method of the present invention is described below with reference to fig. 2.

FIG. 2 illustrates a flow diagram of a feature extraction method according to some embodiments of the invention. As shown in fig. 2, the feature extraction method of this embodiment includes steps S202 to S204, and the feature extraction network includes a convolutional neural network and a core feature extraction layer.

In step S202, the convolutional neural network extracts hidden features from a plurality of pieces of data.

In some embodiments, the convolutional neural network has a residual structure. In most networks without a residual structure, the input to each layer is the output of the previous layer. In a network having a residual structure, the input of a partial layer includes not only the output of an adjacent previous layer but also the output of other layers preceding the previous layer. The structure can improve the training efficiency and accuracy, and further can improve the processing efficiency and accuracy of data deviation correction.

In some embodiments, the convolutional neural network is an inclusion-Resnet network. The inclusion-Resnet network provides a convolutional neural network for google, and has a residual error structure. When training is carried out based on the network, parameters can be adjusted only for the last layer of the network, so that the training efficiency is further improved.

In step S204, the core feature extraction layer extracts core features from the hidden features. Thus, important features among the hidden features can be further extracted.

In some embodiments, the core feature extraction layer is an attention (attention) layer, which is implemented using an attention mechanism. In the implementation process, the existing attention layer module can be utilized to input the hidden features extracted in the previous step through an API interface of the attention layer module, and obtain core features output by the attention layer. The Attention layer performs weight assignment on each sub-feature in the input hidden features through a built-in algorithm so as to highlight the core information in the hidden features.

By the method of the embodiment, hidden and important features in the DPI data can be extracted, so that interference information in the original data can be removed, and prediction can be performed more accurately.

In some embodiments, the LSTM network-based generator is a generator in a generative confrontation network, the generative confrontation network further comprising a decider. Thus, training the generator may be achieved based on training the generative confrontation network. An embodiment of the training method of the present invention is described below with reference to fig. 3.

FIG. 3 illustrates a flow diagram of a training method according to some embodiments of the inventions. As shown in fig. 3, the training method of this embodiment includes steps S302 to S310.

In step S302, target data acquired from DPI data used for training and a plurality of pieces of training data before a time corresponding to the target data are input to a feature extraction network, and core features of the target data and the plurality of pieces of training data output by the feature extraction network are obtained.

The structure of the feature extraction network may refer to the foregoing embodiments, and details are not repeated here.

The target data is corresponding data to be rectified in the actual rectifying process. However, during the training process, the target data has a non-null, non-anomalous value in order to compare the target data with the data generated by the prediction in order to make adjustments to the model.

In step S304, a core feature training sequence of the plurality of pieces of data is constructed according to the corresponding time sequence of the plurality of pieces of training data.

In step S306, the core feature sequence is input to a generator based on the LSTM network trained in advance, and predicted data is generated.

In step S308, the predicted data is input into the feature extraction network, and the core features of the predicted data output by the feature extraction network are obtained.

In step S310, the core features of the prediction data and the core features of the target data are input to the discriminator, so that the feature extraction network and the generative countermeasure network are trained based on the determination result of the discriminator. The discriminator is used for judging whether the generated data is real or not and giving the probability of judging whether the data is real or not. When the probability is about 0.5, it indicates that the discriminator cannot judge the authenticity of the generated data, i.e., the data generated by the generator achieves the effect of being difficult to distinguish from the actual data. The training may be ended at this point.

By the method of the embodiment, the dynamic game mechanism of the generator and the discriminator can be used for training the model, so that the accuracy of prediction is further improved. And in the training process, data at a future moment are predicted based on the time sequence information, and real data at the same moment are compared to train, so that the trained model can be suitable for an actual deviation rectifying application scene.

An embodiment of the data skew correcting apparatus of the present invention is described below with reference to fig. 4.

FIG. 4 illustrates a block diagram of a data skew correction device according to some embodiments of the present invention. As shown in fig. 4, the data skew correction apparatus 40 of this embodiment includes: the feature extraction module 410 is configured to input a plurality of pieces of data before time corresponding to data to be corrected, which is acquired from Deep Packet Inspection (DPI) data, into a feature extraction network, so as to obtain core features of the plurality of pieces of data, which are output by the feature extraction network; a sequence construction module 420 configured to construct a core feature sequence of the plurality of pieces of data in a corresponding time order of the plurality of pieces of data; a data generation module 430 configured to input the core feature sequence into a generator based on a long-short term memory network trained in advance, and generate predicted data; a deskew module 440 configured to replace the deskew data with the predicted data.

In some embodiments, the convolutional neural network has a residual structure.

In some embodiments, the core feature extraction layer is an attention layer.

In some embodiments, the LSTM network-based generator is a generator in a generative confrontation network, the generative confrontation network further comprising a decider; the data deviation correcting device further comprises: a training module 450, configured to input target data acquired from DPI data used for training and a plurality of pieces of training data before time corresponding to the target data into a feature extraction network, and obtain core features of the target data and the plurality of pieces of training data output by the feature extraction network; constructing a core characteristic training sequence of a plurality of pieces of data according to the corresponding time sequence of the plurality of pieces of training data; inputting the core characteristic sequence into a generator based on an LSTM network trained in advance to generate predicted data; inputting the predicted data into a feature extraction network to obtain core features of the predicted data output by the feature extraction network; the core features of the predicted data and the core features of the target data are input into a discriminator so as to train a feature extraction network and a generative confrontation network according to the judgment result of the discriminator.

In some embodiments, the data deskewing method further comprises: the determining module 460 is configured to determine data with an empty numerical value or a field with an abnormal numerical value in the DPI data as data to be corrected; the data before the time corresponding to the data to be corrected and the data to be corrected have the same field, and the numerical values are not null and are not abnormal values.

FIG. 5 is a schematic diagram of a data skew correction apparatus according to another embodiment of the present invention. As shown in fig. 5, the data skew correction apparatus 50 of this embodiment includes: a memory 510 and a processor 520 coupled to the memory 510, the processor 520 being configured to perform a data deskewing method according to any of the embodiments described above based on instructions stored in the memory 510.

Memory 510 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.

FIG. 6 is a schematic diagram of a data skew correction apparatus according to further embodiments of the present invention. As shown in fig. 6, the data skew correction apparatus 60 of this embodiment includes: the memory 610 and the processor 620 may further include an input/output interface 630, a network interface 640, a storage interface 650, and the like. These

interfaces

630, 640, 650 and the connections between the memory 610 and the processor 620 may be, for example, via a bus 660. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement any one of the foregoing data rectification methods.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of data deskewing, comprising:

inputting a plurality of pieces of data before time corresponding to data to be corrected, which is acquired from Deep Packet Inspection (DPI) data, into a feature extraction network to obtain core features of the plurality of pieces of data, which are output by the feature extraction network;

constructing a core characteristic sequence of the plurality of pieces of data according to the corresponding time sequence of the plurality of pieces of data;

inputting the core characteristic sequence into a generator which is trained in advance and based on a long-term and short-term memory network, and generating predicted data;

and replacing the data to be rectified with the predicted data.

2. The data rectification method of claim 1, wherein the feature extraction network comprises a convolutional neural network and a core feature extraction layer;

the convolutional neural network extracts hidden features from the pieces of data, and the core feature extraction layer extracts core features from the hidden features.

3. The data rectification method of claim 2, wherein the convolutional neural network has a residual structure.

4. The data rectification method according to claim 3, wherein the convolutional neural network is an inclusion-Resnet network.

5. The data rectification method according to claim 2, wherein the core feature extraction layer is an attention layer.

6. The data rectification method of claim 1, wherein the LSTM network-based generator is a generator in a generative confrontation network, the generative confrontation network further comprising a decider;

the data deviation rectifying method further comprises the following steps:

inputting target data acquired from DPI data used for training and a plurality of pieces of training data before time corresponding to the target data into a feature extraction network to obtain core features of the target data and the plurality of pieces of training data, wherein the core features are output by the feature extraction network;

constructing a core characteristic training sequence of the plurality of pieces of data according to the corresponding time sequence of the plurality of pieces of training data;

inputting the core characteristic sequence into a generator based on an LSTM network trained in advance to generate predicted data;

inputting the predicted data into a feature extraction network to obtain core features of the predicted data output by the feature extraction network;

inputting the core features of the prediction data and the core features of the target data into the discriminator so as to train the feature extraction network and the generative countermeasure network according to the judgment result of the discriminator.

7. The data rectification method of claim 1, further comprising:

determining data with empty numerical values or fields with abnormal numerical values in the DPI data as data to be corrected;

and a plurality of pieces of data before the time corresponding to the data to be corrected have the same field as the data to be corrected, and the numerical values are not null and are not abnormal values.

8. A data skew correction apparatus comprising:

the system comprises a feature extraction module, a data processing module and a data processing module, wherein the feature extraction module is configured to input a plurality of pieces of data before time corresponding to data to be corrected, which is acquired from Deep Packet Inspection (DPI) data, into a feature extraction network to obtain core features of the plurality of pieces of data, which are output by the feature extraction network;

a sequence construction module configured to construct a core feature sequence of the plurality of pieces of data in a corresponding time order of the plurality of pieces of data;

a data generation module configured to input the core feature sequence into a generator based on a long-short term memory network trained in advance, and generate predicted data;

a deskew module configured to replace the data to be deskewed with predicted data.

9. A data skew correction apparatus comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the data deskewing method of any of claims 1-7 based on instructions stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the data rectification method of any one of claims 1 to 7.