CN113360484B

CN113360484B - Data correction method, device and computer readable storage medium

Info

Publication number: CN113360484B
Application number: CN202010145446.9A
Authority: CN
Inventors: 安翔宇; 翟艳梅; 范晓旭; 周松桥
Original assignee: Tianyi Cloud Technology Co Ltd
Current assignee: Tianyi Cloud Technology Co Ltd
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2024-07-09
Anticipated expiration: 2040-03-05
Also published as: CN113360484A

Abstract

The invention discloses a data correction method, a data correction device and a computer readable storage medium, and relates to the field of data processing. The data correction method comprises the following steps: inputting a plurality of pieces of data before the time corresponding to the data to be rectified, which are obtained from deep packet inspection DPI data, into a feature extraction network to obtain core features of the plurality of pieces of data output by the feature extraction network; constructing a core feature sequence of the plurality of data according to the corresponding time sequence of the plurality of data; inputting the core characteristic sequence into a pre-trained generator based on a long-short-term memory network to generate predicted data; and replacing the data to be rectified by the predicted data. The invention realizes the scheme of predicting by using the time sequence information of DPI data. The prediction mode is more in line with the characteristics of DPI data, so that the correction effectiveness of the DPI data and the correction accuracy of the corrected data are improved.

Description

Data correction method, device and computer readable storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a data correction method, apparatus, and computer readable storage medium.

Background

Because of uncontrollable factors such as network fluctuation, resource load, source data abnormality and the like, abnormal data can be generated in the DPI data transmission process, so that the acquired DPI data is low in quality, and the development of data products is greatly influenced. Aiming at the problem, the related technology adopts methods of manual rules or clustering models Canopy, K-means and the like to correct the DPI data.

Disclosure of Invention

The inventors found that the correction result of the method used in the related art is not very ideal after analysis.

One technical problem to be solved by the embodiment of the invention is as follows: how to improve the correction effectiveness of DPI data and the accuracy of corrected data.

According to a first aspect of some embodiments of the present invention, there is provided a data deskewing method, including: inputting a plurality of pieces of data before the time corresponding to the data to be rectified, which are obtained from deep packet inspection DPI data, into a feature extraction network to obtain core features of the plurality of pieces of data output by the feature extraction network; constructing a core feature sequence of the plurality of data according to the corresponding time sequence of the plurality of data; inputting the core characteristic sequence into a pre-trained generator based on a long-short-term memory network to generate predicted data; and replacing the data to be rectified by the predicted data.

In some embodiments, the feature extraction network includes a convolutional neural network and a core feature extraction layer; the convolutional neural network extracts hidden features from a plurality of pieces of data, and the core feature extraction layer extracts core features from the hidden features.

In some embodiments, the convolutional neural network has a residual structure.

In some embodiments, the convolutional neural network is a Inception-Resnet network.

In some embodiments, the core feature extraction layer is the attention attention layer.

In some embodiments, the LSTM network based generator is a generator in a generative countermeasure network, the generative countermeasure network further comprising a determiner; the data correction method further comprises the following steps: inputting target data acquired from DPI data for training and a plurality of pieces of training data before time corresponding to the target data into a feature extraction network to acquire core features of the target data and the plurality of pieces of training data output by the feature extraction network; constructing a core feature training sequence of a plurality of pieces of data according to the corresponding time sequence of the plurality of pieces of training data; inputting the core feature sequence into a pre-trained LSTM network-based generator to generate predicted data; inputting the predicted data into a feature extraction network to obtain core features of the predicted data output by the feature extraction network; the core features of the predicted data and the core features of the target data are input into the discriminator so as to train the feature extraction network and the generation type countermeasure network according to the judging result of the discriminator.

In some embodiments, the data deskewing method further comprises: determining the data with null values or the fields with abnormal values in DPI data as data to be rectified; the data before the time corresponding to the data to be rectified and the data to be rectified have the same field, and the numerical values are not null and are not abnormal values.

According to a second aspect of some embodiments of the present invention, there is provided a data rectification apparatus comprising: the feature extraction module is configured to input a plurality of pieces of data before the time corresponding to the data to be rectified, which is acquired from deep packet inspection DPI data, into the feature extraction network, and obtain core features of the plurality of pieces of data output by the feature extraction network; the sequence construction module is configured to construct a core feature sequence of the plurality of data according to the corresponding time sequence of the plurality of data; the data generation module is configured to input the core feature sequence into a pre-trained long-period memory network-based generator to generate predicted data; and the deviation rectifying module is configured to replace the data to be rectified by the predicted data.

According to a third aspect of some embodiments of the present invention, there is provided a data rectification apparatus, comprising: a memory; and a processor coupled to the memory, the processor configured to perform any of the foregoing data deskewing methods based on the instructions stored in the memory.

According to a fourth aspect of some embodiments of the present invention, there is provided a computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements any of the foregoing data deskewing methods.

Some of the embodiments of the above invention have the following advantages or benefits: the invention can predict the core characteristics of a plurality of pieces of data before the time corresponding to the data to be rectified so as to obtain the correct data which should appear at the time corresponding to the data to be rectified, thereby realizing the scheme of predicting by utilizing the time sequence information of DPI data. The prediction mode is more in line with the characteristics of DPI data, so that the correction effectiveness of the DPI data and the correction accuracy of the corrected data are improved.

Other features of the present invention and its advantages will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 illustrates a flow diagram of a data deskewing method according to some embodiments of the invention.

Fig. 2 illustrates a flow diagram of a feature extraction method according to some embodiments of the invention.

Fig. 3 illustrates a flow diagram of a training method according to some embodiments of the invention.

Fig. 4 is a schematic diagram illustrating a structure of a data rectification apparatus according to some embodiments of the present invention.

Fig. 5 shows a schematic structural diagram of a data correction device according to other embodiments of the present invention.

Fig. 6 shows a schematic structural diagram of a data correction device according to further embodiments of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

After further analysis, the inventor finds that one characteristic of Deep Packet Inspection (DPI) data is that the deep packet Inspection (DEEP PACKET in short) data has the acquisition time of each field, so that the Deep Packet Inspection (DPI) data has stronger time sequence. Therefore, the deskewing of DPI data can be achieved based on mining of DPI data timing information. An embodiment of the data deskewing method of the present invention is described below with reference to fig. 1.

Fig. 1 illustrates a flow diagram of a data deskewing method according to some embodiments of the invention. As shown in fig. 1, the data rectification method of this embodiment includes steps S102 to S108.

In step S102, a plurality of pieces of data before a time corresponding to the data to be rectified obtained from the DPI data are input into a feature extraction network, and core features of the plurality of pieces of data output by the feature extraction network are obtained.

In some embodiments, the data whose value is null, or the field whose value is abnormal, in the DPI data is determined as the data to be deskewed. The data before the time corresponding to the data to be rectified and the data to be rectified have the same field, and the numerical values are not null and are not abnormal values. The identification of the data to be rectified may be performed by searching and matching, etc., and will not be described in detail herein.

For example, DPI data records the amount of traffic used by a particular cell phone user per hour. When the flow of the mobile phone user at 10 am on a certain day is found to be 630G, that is, the value is obviously beyond a reasonable value, or the flow field is empty, the flow of the mobile phone user at 10 am on the certain day can be used as data to be rectified, and the flow data of the user at each hour from 0 am to 9 am on the first two days on the certain day can be used as 'multiple data' in the step S102. According to the actual situation, the data to be rectified can also comprise a group of fields.

In step S104, a core feature sequence of the plurality of pieces of data is constructed in accordance with the corresponding time sequence of the plurality of pieces of data.

In step S106, the core feature sequence is input into a pre-trained Long Short-Term Memory network (LSTM) based generator to generate predicted data.

LSTM is a recurrent neural network. The LSTM can not only predict information of the next time using information of the current time, but also acquire information of an earlier time using a Cell structure in the network. In the network behavior of the user, not only the adjacent front and rear moments are associated, but also the current moment and the earlier moment are associated. For example, a user may browse videos in a video website at a daily commute time, which is relatively fixed. Therefore, when the video browsing data of the 8 th point of the t day of the user is rectified, the video browsing data of the 8 th point of the t-1 th day also has great reference value. Thus, by using the LSTM network, prediction data can be generated more accurately.

In step S108, the predicted data is used to replace the data to be rectified.

By the method of the embodiment, prediction can be performed according to the core characteristics of a plurality of pieces of data before the time corresponding to the data to be rectified, so as to obtain correct data which should appear at the time corresponding to the data to be rectified, and therefore a scheme for predicting by using time sequence information of DPI data is realized. The prediction mode is more in line with the characteristics of DPI data, so that the correction effectiveness of the DPI data and the correction accuracy of the corrected data are improved.

In some embodiments, the feature extraction network may be comprised of multiple sub-networks or layers. An embodiment of the feature extraction method of the present invention is described below with reference to fig. 2.

Fig. 2 illustrates a flow diagram of a feature extraction method according to some embodiments of the invention. As shown in fig. 2, the feature extraction method of this embodiment includes steps S202 to S204, and the feature extraction network includes a convolutional neural network and a core feature extraction layer.

In step S202, the convolutional neural network extracts hidden features from the pieces of data.

In some embodiments, the convolutional neural network has a residual structure. In most networks without residual structure, the input of each layer is the output of the previous layer. Whereas in a network with a residual structure, the inputs of a partial layer include not only the outputs of the adjacent previous layer, but also the outputs of other layers preceding the previous layer. The structure can improve training efficiency and accuracy, and further can improve data correction processing efficiency and accuracy.

In some embodiments, the convolutional neural network is a Inception-Resnet network. Inception-Resnet is a convolutional neural network proposed by google, and has a residual structure. When training is performed based on the network, only the last layer of the network can be subjected to parameter adjustment, so that training efficiency is further improved.

In step S204, the core feature extraction layer extracts core features from the hidden features. Thus, important features among the hidden features can be further extracted.

In some embodiments, the core feature extraction layer is an attention (attention) layer, and the attention layer is implemented using an attention mechanism. In the implementation process, the existing attention-layer module can be utilized to input the hidden features extracted in the previous step through the API interface of the attention-layer module, and core features output by the attention-layer module are obtained. The Attention layer performs weight distribution on each sub-feature in the input hidden features through a built-in algorithm so as to highlight core information in the hidden features.

By the method of the embodiment, hidden and important characteristics in DPI data can be extracted, so that interference information in original data can be removed, and prediction can be performed more accurately.

In some embodiments, the LSTM network based generator is a generator in a generative countermeasure network, the generative countermeasure network further comprising a determiner. Thus, the generator may be trained based on training implementations of the generated countermeasure network. An embodiment of the training method of the present invention is described below with reference to fig. 3.

Fig. 3 illustrates a flow diagram of a training method according to some embodiments of the invention. As shown in fig. 3, the training method of this embodiment includes steps S302 to S310.

In step S302, target data acquired from DPI data for training and a plurality of pieces of training data before a time corresponding to the target data are input into a feature extraction network, and core features of the target data and the plurality of pieces of training data output by the feature extraction network are obtained.

The structure of the feature extraction network may refer to the foregoing embodiments, and will not be described herein.

The target data is the corresponding data to be rectified in the actual rectification process. However, during the training process, the target data has non-null, non-outlier values in order to compare the target data with the predictively generated data for adjustment of the model.

In step S304, a core feature training sequence of the plurality of pieces of data is constructed according to the corresponding time sequence of the plurality of pieces of training data.

In step S306, the core feature sequence is input into a pre-trained LSTM network-based generator, generating predicted data.

In step S308, the predicted data is input into the feature extraction network, and the core features of the predicted data output by the feature extraction network are obtained.

In step S310, the core features of the prediction data and the core features of the target data are input to the discriminator so as to train the feature extraction network and the generation countermeasure network according to the determination result of the discriminator. The discriminator is used for judging whether the generated data is real or not and giving a probability of judging whether the generated data is real or not. When the probability is about 0.5, it is explained that the discriminator cannot judge the authenticity of the generated data, that is, the data generated by the generator has the effect that the discrimination from the real data is difficult. At this point the training may be ended.

By the method of the embodiment, the model can be trained by using the dynamic game mechanism of the generator and the discriminator, so that the prediction accuracy is further improved. In addition, the data at the future moment is predicted based on the time sequence information in the training process, and the real data at the same time are compared to train, so that a model obtained through training can be suitable for an actual deviation correcting application scene.

An embodiment of the data rectification apparatus of the present invention is described below with reference to fig. 4.

Fig. 4 is a schematic diagram illustrating a structure of a data rectification apparatus according to some embodiments of the present invention. As shown in fig. 4, the data correction device 40 of this embodiment includes: the feature extraction module 410 is configured to input a plurality of pieces of data before a time corresponding to the data to be rectified, which is acquired from deep packet inspection DPI data, into the feature extraction network, and obtain core features of the plurality of pieces of data output by the feature extraction network; a sequence construction module 420 configured to construct a core feature sequence of the plurality of pieces of data according to a corresponding temporal order of the plurality of pieces of data; a data generation module 430 configured to input the core feature sequence into a pre-trained long-short term memory network-based generator, generating predicted data; the deskew module 440 is configured to replace the data to be deskewed with the predicted data.

In some embodiments, the convolutional neural network has a residual structure.

In some embodiments, the LSTM network based generator is a generator in a generative countermeasure network, the generative countermeasure network further comprising a determiner; the data deviation correcting device further comprises: a training module 450 configured to input target data acquired from DPI data for training, a plurality of pieces of training data before a time corresponding to the target data, into a feature extraction network, and obtain core features of the target data and the plurality of pieces of training data output by the feature extraction network; constructing a core feature training sequence of a plurality of pieces of data according to the corresponding time sequence of the plurality of pieces of training data; inputting the core feature sequence into a pre-trained LSTM network-based generator to generate predicted data; inputting the predicted data into a feature extraction network to obtain core features of the predicted data output by the feature extraction network; the core features of the predicted data and the core features of the target data are input into the discriminator so as to train the feature extraction network and the generation type countermeasure network according to the judging result of the discriminator.

In some embodiments, the data deskewing method further comprises: a determining module 460 configured to determine, as the data to be rectified, data with null values or fields with abnormal values in the DPI data; the data before the time corresponding to the data to be rectified and the data to be rectified have the same field, and the numerical values are not null and are not abnormal values.

Fig. 5 shows a schematic structural diagram of a data correction device according to other embodiments of the present invention. As shown in fig. 5, the data rectifying device 50 of this embodiment includes: a memory 510 and a processor 520 coupled to the memory 510, the processor 520 being configured to perform the data deskewing method of any of the preceding embodiments based on instructions stored in the memory 510.

The memory 510 may include, for example, system memory, fixed nonvolatile storage media, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), and other programs.

Fig. 6 shows a schematic structural diagram of a data correction device according to further embodiments of the present invention. As shown in fig. 6, the data correction device 60 of this embodiment includes: the memory 610 and the processor 620 may also include an input-output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630, 640, 650 and the memory 610 and processor 620 may be connected by, for example, a bus 660. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. Network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as SD cards, U-discs, and the like.

An embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements any one of the foregoing data deskewing methods.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flowchart and/or block of the flowchart illustrations and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A method of data deskewing, comprising:

Inputting a plurality of pieces of data before the time corresponding to the data to be rectified, which are obtained from deep packet inspection DPI data, into a feature extraction network to obtain core features of the plurality of pieces of data output by the feature extraction network, wherein the feature extraction network comprises a convolutional neural network and a core feature extraction layer, the convolutional neural network extracts hidden features from the plurality of pieces of data, and the core feature extraction layer extracts core features from the hidden features;

constructing a core feature sequence of the plurality of data according to the corresponding time sequence of the plurality of data;

Inputting the core feature sequence into a pre-trained generator based on a long-short-term memory network LSTM to generate predicted data, wherein the generator based on the LSTM is a generator in a generated countermeasure network, and the generated countermeasure network further comprises a discriminator;

replacing the data to be rectified with predicted data

The data correction method further comprises the following steps:

inputting target data acquired from DPI data for training and a plurality of pieces of training data before the time corresponding to the target data into a feature extraction network, and obtaining core features of the target data and the plurality of pieces of training data output by the feature extraction network;

Constructing a core feature training sequence of the plurality of pieces of data according to the corresponding time sequence of the plurality of pieces of training data;

Inputting the core feature sequence into a pre-trained LSTM network-based generator to generate predicted data;

Inputting the predicted data into a feature extraction network to obtain core features of the predicted data output by the feature extraction network;

And inputting the core features of the predicted data and the core features of the target data into the discriminator so as to train the feature extraction network and the generation type countermeasure network according to the judging result of the discriminator.

2. The data deskewing method of claim 1, wherein the convolutional neural network has a residual structure.

3. The data deskewing method of claim 2, wherein the convolutional neural network is a Inception-Resnet network.

4. The data deskewing method of claim 1, wherein the core feature extraction layer is an attention attention layer.

5. The data deskewing method of claim 1, further comprising:

Determining the data with null values or the fields with abnormal values in DPI data as data to be rectified;

the plurality of pieces of data before the time corresponding to the data to be rectified and the data to be rectified have the same field, and the numerical values are not null and are not abnormal values.

6. A data rectification apparatus comprising:

The device comprises a feature extraction module, a feature extraction module and a correction module, wherein the feature extraction module is configured to input a plurality of pieces of data before the time corresponding to data to be corrected, which are acquired from deep packet inspection DPI data, into a feature extraction network to obtain core features of the plurality of pieces of data, which are output by the feature extraction network, wherein the feature extraction network comprises a convolutional neural network and a core feature extraction layer, the convolutional neural network extracts hidden features from the plurality of pieces of data, and the core feature extraction layer extracts core features from the hidden features;

The sequence construction module is configured to construct a core feature sequence of the plurality of data according to the corresponding time sequence of the plurality of data;

a data generation module configured to input the core feature sequence into a pre-trained long-short-term memory network LSTM-based generator, generating predicted data, the LSTM-network-based generator being a generator in a generative countermeasure network, the generative countermeasure network further comprising a discriminant;

the deviation rectifying module is configured to replace the data to be rectified by predicted data;

A training module configured to: inputting target data acquired from DPI data for training and a plurality of pieces of training data before the time corresponding to the target data into a feature extraction network, and obtaining core features of the target data and the plurality of pieces of training data output by the feature extraction network; constructing a core feature training sequence of the plurality of pieces of data according to the corresponding time sequence of the plurality of pieces of training data; inputting the core feature sequence into a pre-trained LSTM network-based generator to generate predicted data; inputting the predicted data into a feature extraction network to obtain core features of the predicted data output by the feature extraction network; and inputting the core features of the predicted data and the core features of the target data into the discriminator so as to train the feature extraction network and the generation type countermeasure network according to the judging result of the discriminator.

7. A data rectification apparatus comprising:

A memory; and

A processor coupled to the memory, the processor configured to perform the data deskewing method according to any one of claims 1-5, based on instructions stored in the memory.

8. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the data deskewing method according to any one of claims 1-5.