CN112529025A

CN112529025A - Data processing method and device

Info

Publication number: CN112529025A
Application number: CN201910877235.1A
Authority: CN
Inventors: 陈雷; 应江勇; 高聪立
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2021-03-19

Abstract

The embodiment of the application relates to a data processing method, which comprises the following steps: acquiring an original data set; obtaining a first index by adopting average port Jersen Shannon divergence; inputting data in the original data set to a generation countermeasure network and generating first expansion data; combining the first expansion data and the data in the original data set, and obtaining a second index by adopting average port Jersen Shannon divergence; when the second index is larger than the first index, adding the first expansion data into the original data set to obtain a first data set; the original data set is replaced with the first data set. By the method, diversity data can be continuously acquired, and complete and diverse data sets can be obtained. Meanwhile, the generalization capability of the network model can be effectively improved by the obtained complete training data set.

Description

Data processing method and device

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a data processing method and apparatus for continuously acquiring diverse data based on a generated countermeasure network.

Background

At present, an image recognition technology based on machine learning is used for training a large number of sample images to obtain a network model with a wide recognition function. However, in the actual training process, there may be some categories that collect relatively small number of images and cannot fully reflect the information of the categories. However, the network model is trained by using some sample images with a rare number of classes, the accuracy of the trained model is low, and the migration capability is weak.

The accuracy and robustness of a model are closely related to the training data, but the method of continuously acquiring data by manual labeling is very costly and inefficient. The traditional data enhancement modes include simple zooming of the image, a cropping operation with a specified size taking the image as a central point, random horizontal or vertical turning of the image and the like. The sample number can be expanded by a traditional data enhancement mode, but the expanded data is likely to have more redundant data. The current image enhancement mode is mainly used for specific task scenes, wherein the image enhancement is to enhance useful information in images, and aims to improve the visual effect of the images, enhance the image interpretation and recognition effect and meet the requirements of certain special analysis aiming at the application occasions of given images. For the current image enhancement mode, the continuously generated data still has defects in data redundancy.

Disclosure of Invention

The embodiment of the application provides a method and a device for acquiring diversity data. Can continuously acquire data with diversity and continuously expand the original data set, thereby obtaining complete and diverse data sets

In a first aspect, a method for data processing is provided, the method including: acquiring an original data set; evaluating the original data set by adopting average port Jersen Shannon divergence to obtain a first index, wherein the first index is used for expressing the diversity of the original data set; inputting data in the original data set to a generation countermeasure network (GAN) and generating first expansion data; combining the first expansion data and the data in the original data set, and evaluating by adopting average port Jersen Shannon divergence to obtain a second index, wherein the second index is used for representing the diversity of the original data set after the first expansion data is combined; when the second index is larger than the first index, adding the first expansion data into the original data set to obtain a first data set; the original data set is replaced with the first data set. The average port jackson shannon divergence (Mean acceptance jonsen-shannon, Mean _ acceptance _ JS) represents the port jackson shannon divergence of each data average in a data set, namely the average distance between each data in the data set, so as to show whether the diversity of the data set is rich or not. The JS divergence is also called the JS distance and is used for measuring the symmetry measurement of two probability distribution differences. GAN is a deep learning model, and is generally used for data generation, which is very similar to real data and is used for data enhancement.

In one possible embodiment, after obtaining the first indicator, the method further comprises: extracting the characteristics of the data of the original data set; detecting data of the original data set after feature extraction by adopting a Local Outlier Factor (LOF) algorithm to obtain an outlier set; the data in the cluster point set is input to the GAN and first augmented data is generated. Wherein, in feature space, the proximity between an outlier and its nearest neighbor significantly deviates from the proximity between other data in the dataset and their own nearest neighbors; an outlier set is then the set of all outliers.

In one possible embodiment, the loss function employed to generate the antagonistic network GAN has a plurality of parameters, wherein the parameters include port jackson shannon divergence and nuclear maximum mean difference. Wherein, the port Jensen-shannon divergence (inclusion _ JS) is to utilize the port (inclusion) network to map the data to the category space, and the Jensen-shannon divergence (divergence) is adopted to measure the distance between the data under the category space. Kernel maximum mean difference (Kernel _ MMD) is a Kernel function calculation maximum mean difference, which is used to judge the degree of similarity between two distributions; the dimensionality of the data set in the feature space is high, the solution of the data set only uses the inner product, a certain function which is the kernel function can be directly calculated in order to simplify the calculation if the function which is just the inner product in the high-dimensional space just exists in the low-dimensional space.

In one possible embodiment, the method further comprises: for the labels in the original data set, introducing data with the same labels from the public data set; performing feature extraction on data in the public data sets with the same label; combining the data in the public data sets with the same label and the data in the original data sets, and evaluating by adopting average port Jersen Shannon divergence to obtain a third index, wherein the third index is used for representing the diversity of the original data sets after the data in the public data sets with the same label are combined; and when the third index is larger than the first index, taking the data in the public data set with the same label as second expansion data.

In one possible embodiment, having the same tag includes: the tags in the public dataset are the same as the tags in the original dataset; or that the tags in the public data set belong to a subset of the tags in the original data set.

In one possible embodiment, the method further comprises: adding the second expansion data into the original data set to obtain a second data set; the original data set is replaced with the second data set.

In one possible embodiment, the method further comprises: generating third expansion data by adopting GAN for the second expansion data; combining the third expansion data and the data in the original data set, and evaluating by adopting average port Jersen Shannon divergence to obtain a fourth index, wherein the fourth index is used for representing the diversity of the original data set after combining the third expansion data; when the fourth index is larger than the first index, adding the third expansion data into the original data set to obtain a third data set; the original data set is replaced with the third data set.

In one possible embodiment, the data type of the data in the original data set, the first data set, the second data set and/or the third data set is a picture type.

In a second aspect, there is provided a data processing apparatus comprising: the acquisition module is used for acquiring an original data set; the evaluation module is used for evaluating the original data set by adopting average port Jersen Shannon divergence to obtain a first index, and the first index is used for expressing the diversity of the original data set; the first generation module is used for inputting the data in the original data set to the generation countermeasure network GAN and generating first expansion data; the evaluation module is further used for combining the first expansion data and the data in the original data set, and evaluating by adopting average port Jersen Shannon divergence to obtain a second index, wherein the second index is used for representing the diversity of the original data set after the first expansion data is combined; the adding module is used for adding the first expansion data into the original data set to obtain a first data set when the second index is larger than the first index; a replacement module to replace the original data set with the first data set.

In one possible embodiment, the apparatus further comprises: the characteristic extraction module is used for extracting the characteristics of the data of the original data set; the detection module is used for detecting the data of the original data set after the characteristic extraction by adopting a local outlier factor LOF algorithm to obtain an outlier set; the first generating module is further configured to input data in the cluster point set to the GAN and generate first augmented data.

In one possible embodiment, the first generating module comprises: generating an antagonistic network GAN; the GAN employs a loss function having a plurality of parameters, wherein the parameters include port jackson shannon divergence and nuclear maximum mean difference.

In one possible embodiment, the apparatus further comprises: the system comprises an importing module, a data processing module and a data processing module, wherein the importing module is used for importing data with the same label from a public data set aiming at the label in an original data set; the characteristic extraction module is also used for extracting the characteristics of the data in the public data sets with the same label; the evaluation module is further used for combining the data in the public data sets with the same label and the data in the original data sets, and evaluating by adopting average port Jersen Shannon divergence to obtain a third index, wherein the third index is used for representing the diversity of the original data sets after the data in the public data sets with the same label are combined; and the second generation module is used for taking the data in the public data set with the same label as the second expansion data when the third index is larger than the first index.

In one possible embodiment, the adding module is further configured to: adding the second expansion data into the original data set to obtain a second data set; the replacement module is further configured to replace the original data set with the second data set.

In one possible embodiment, the first generating module is further configured to: generating third expansion data by adopting GAN for the second expansion data; the evaluation module is further used for combining the third expansion data and the data in the original data set and evaluating by adopting average port Jersen Shannon divergence to obtain a fourth index, and the fourth index is used for representing the diversity of the original data set after the third expansion data is combined; the adding module is further used for adding the third expansion data to the original data set to obtain a third data set when the fourth index is larger than the first index; the replacement module is further configured to replace the original data set with the third data set.

In a third aspect, a computer-readable storage medium is provided, having instructions stored thereon, wherein the instructions, when executed on a terminal, cause the terminal to perform the method of the first aspect.

In a fourth aspect, there is provided a computer system comprising a processor coupled to a memory, the processor reading and executing instructions in the memory such that the computer system implements the method of the first aspect.

The application discloses a method and equipment for acquiring diversity data, which are used for extracting an original data set through a local outlier factor LOF algorithm to obtain an outlier set. And generate new data augmented by a Generating Antagonistic Network (GAN). And controlling the difference between the generated new data and the real data by introducing two regular terms of the port Jersen Shannon divergence and the nuclear maximum average difference distance into the loss function of the GAN. And acquiring data in the public data set with the same label, and then increasing the diversity of the new data through the average distance dispersion evaluation. And finally, evaluating whether the newly added data leads the diversity of the original data set to be gained or not through the average distance dispersion. So as to prevent redundant data from being added and avoid negative influence on subsequent training. By the method, diversity data can be continuously acquired, and complete and diverse data sets can be obtained. Meanwhile, the generalization capability of the network model can be effectively improved by the obtained complete training data set.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 2 is a block diagram of a data processing system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a data processing system interaction provided by an embodiment of the present application;

fig. 4 is a flowchart of a data processing method according to an embodiment of the present application;

fig. 5 is a schematic diagram of diversity evaluation of jassen shannon divergence of an average port according to an embodiment of the present application;

fig. 6a is a schematic diagram of two-dimensional distribution of an original data set according to an embodiment of the present application;

fig. 6b is a schematic diagram of a two-dimensional point set distribution provided in the embodiment of the present application;

fig. 6c is a schematic diagram of a two-dimensional distribution of outlier clusters according to an embodiment of the present application;

fig. 7 is a diagram of a structure of a generation countermeasure network according to an embodiment of the present application;

fig. 8 is a schematic diagram of generation data of a countermeasure network according to an embodiment of the present application;

FIG. 9 is a comparison diagram of the classification accuracy of a data set according to an embodiment of the present application;

FIG. 10 is a flow chart of another data processing method provided in the embodiments of the present application;

fig. 11 is a flowchart of another data processing method provided in an embodiment of the present application;

fig. 12 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 13 is a schematic diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The application is mainly applied to a data processing platform. The data processing platform is located in an artificial intelligence system, and the artificial intelligence system is explained from two dimensions of an intelligent information chain and an IT value chain.

The "smart information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.

The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure:

the infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.

The data processing platform belongs to a data processing part, is in the early stage of model training, and aims to provide original training data for steps of machine learning, deep learning and the like so as to train a data model meeting expectations. As shown in fig. 1, fig. 1 is a schematic view of an application scenario provided in the embodiment of the present application. The data processing platform is an intelligent data platform with human-computer cooperation, and has the artificial intelligence capability of higher construction efficiency, faster training and stronger model. On a data processing platform, the platform may include a data pre-processing, data auto-labeling, data auto-enhancement, or plug-in data processing framework. The method is mainly applied to the automatic data enhancement part outlined by the broken line.

In some schemes, a small-class sample set in an original data set is expanded mainly by generating a countermeasure network (GAN), so that the image classification accuracy rate under the condition that the data volumes of different classes in the data set are unbalanced is improved. Wherein the sample data in the original data set comprises one or more categories, and for some categories of samples occupying only a very small part of the whole original data set, the category may be called a small category, and the set of the category samples may be called a small category sample set. Firstly, selecting small-category sample images in a data set to obtain an original small-category sample image training set. Then inputting the original small category sample image training set into the GAN to generate a small category generated image sample, wherein the sample is a generated image generated by the GAN; and adding the generated image sample into the original subclass sample image training set to obtain the generated subclass sample image training set. And finally, training the image classification network by using the original small-class sample image training set, and then training the trained image classification network by using the generated small-class sample image training set to finally obtain an optimal image classification network, and classifying the input images by using the network. However, in this scheme, when generating images using GAN, a large amount of data is also required to generate a good quality image set. If the small-category image sets with a small number are directly used for generating images, the quality of the image sets generated by the GAN is poor, and the improvement on the network classification accuracy is very limited.

Aiming at the problems, the scheme capable of automatically and continuously acquiring the diversity data is designed, so that the data enhancement function in the data frame is greatly improved, and the problem of data redundancy in other schemes can be solved.

As shown in fig. 2, fig. 2 is a schematic diagram of a data processing system according to an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of a data processing system architecture, and as shown in fig. 2, the data auto-enhancement part applied in the present application may be located in a server. The software layer located on the server may include the data set diversity evaluation means 201, GAN generation data means 202 and public data set importing means 203 referred to herein. The hardware layer located on the server may include a processor 204 and a memory 205. The memory 205 may store the original data set and the new data set after continuously updating the diversity data, and may also store computer program code for performing the methods of the present application. The processor 204 is used for reading the corresponding program code from the memory to execute the method of the present application, and at the same time, the processor 204 reads the original data set from the memory 205, so that the data set diversity evaluation apparatus 201 of the software layer can perform evaluation according to the read original data set. It should be noted by those skilled in the art that the processor 204 located at the hardware layer may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), or any other special-purpose chip. And a memory, which may include an internal memory and an external memory.

For a conventional data processing platform, data to be processed may be obtained from another data source through file transfer protocol (ftp), or data may be read from a certain file, a database, or from a memory. According to the scheme provided by the application, only the data source of the data processing platform needs to be updated to the interface of the data automatic enhancement module. The method and the device realize continuous acquisition of diversity data and continuous expansion of data in a data set.

As shown in fig. 3, fig. 3 is an interaction diagram of a data processing system according to an embodiment of the present application.

Fig. 3 shows a system interaction schematic of the data set diversity evaluation apparatus 201, the public data set importing apparatus 203 and the GAN generation data apparatus 202 in fig. 2.

The data set diversity evaluation apparatus 201 includes a first storage module 2011 and a first average port johnson divergence module 2012, the GAN generation data apparatus 202 includes a Local Outliers Factor (LOF) module 2021, a GAN module 2022 and a second average port johnson divergence module 2023, and the public data set introducing apparatus 203 includes a second storage module 2031 and a third average port johnson divergence module 2032. In one example, the GAN module 2022 may further include a port jackson shannon divergence sub-module 20221 and a kernel maximum average difference sub-module 20222. In another example, the system further includes a third memory module 2041. The first storage module 2011 is used for retrieving and storing data of the original data set from the storage 205 shown in fig. 2; the second storage module 2023 is used for retrieving and storing the data of the public data set from the memory 205 shown in fig. 2; the third storage module 2032 is used for storing data of a new data set.

The data set diversity evaluation apparatus 201 first determines the diversity of the original data set through the first average port jackson shannon divergence module 2012, and then performs feature extraction on the data of the original data set. The data after feature extraction is then input to the LOF module 2021 in the GAN generation data device 202. The LOF module 2021 performs outlier detection on the input data based on the density and determines an outlier set. The LOF module then inputs the cluster set to the GAN module 2022 so that the GAN module generates the first augmented data X'. In order to ensure that the generated data is very close to the real data, the GAN module 2022 is completed by continuously correcting the data through the port jackson shannon divergence submodule and the kernel maximum average difference submodule during the training of the GAN module 2022. Meanwhile, for data having a tag in the original data set, the data set diversity evaluation apparatus 201 may determine the tag in the original data set and transmit the tag information to the public data set importing apparatus 203. The public data set importing apparatus 203 extracts data having the same tag from the public data set based on the tag that the original data set has. The same tags may be the same tags, or the tags in the public data set belong to a subset of the tags in the original data set. And the extracted data is taken as the second expanded data X ". X ', X ", or X' + X" is then added to the original data set, thereby forming a new data set. Finally, the data set diversity evaluation means 201 will determine whether the data diversity of the new data set is greater than the diversity of the original data set. If the new data set is more diverse, the new data set is substituted for the original data set, in other words, X ', X ", or X' + X" is added to the original data set. Therefore, data in the original data set are continuously expanded, and meanwhile, the diversity of the expanded data set is ensured to be high.

The technical solutions in the embodiments of the present application will be described in detail below with reference to the drawings in the embodiments of the present application.

As shown in fig. 4, fig. 4 is a flowchart of a data processing method according to an embodiment of the present application.

As shown in fig. 4, the present application discloses a method of data processing, which describes the process of generating diversity data from a data set diversity evaluation apparatus 201 via a GAN generation data apparatus 202 and expanding to an original data set as shown in fig. 3. The method may comprise the steps of:

s401, acquiring data of the original data set.

In one embodiment, the data diversity assessment device first obtains the data of the original data set. In one example, the data set diversity evaluation apparatus 201 reads the data of the original data set from the memory 204 and stores the data in the first storage module 2011. As yet another example, the data type of the data in the original data set may be a picture.

S402, evaluating the first data set by adopting the average distance dispersion to obtain a first index.

In one embodiment, the data diversity evaluation device 201 evaluates the raw data set using the average port jackson shannon divergence to obtain a first index. The first index is used for representing the diversity of the original data set, and when the index value is higher, the higher the diversity of the original data set is proved. In one example, the first index may be represented by M1.

In one example, the data diversity evaluation apparatus 201 may first evaluate the diversity of the original data set by using Mean acceptance jensen-shannon (Mean _ inclusion _ JS).

In one embodiment, data redundancy is typically present for high-dimensional data, and a "dimension of disaster" phenomenon is common. The dimensional disaster means a scene that various problems are encountered when analyzing and organizing a high-dimensional space as the spatial dimension increases and the volume index increases. In the space of the original sample set, the distance between samples of the same type is often larger than the distance between samples of different types, so the similarity between data cannot be directly based on distance measurement. In some schemes, a dimensionality reduction method is generally used, such as Principal Component Analysis (PCA) or Robust Principal Component Analysis (RPCA). But the disadvantages are also obvious, and the method is not suitable for nonlinear data because the real situation is complex and the linear mapping expression capability is very limited. The conventional non-linear mapping, for example, using a kernel function, can obtain the non-linear distance of the sample piece, but the specific high-dimensional space projected cannot be known, which results in poor applicability.

In one example, convolutional neural networks perform well in many computer data areas such as image recognition, classification, and the like. The convolutional neural network is a feedforward neural network, and the artificial neurons of the convolutional neural network can respond to a part of surrounding units in a coverage range and have excellent performance on large-scale image processing. At present, a port (initiation) network is a convolutional neural network model with both width and depth, and has strong classification capability and is relatively practical. The acceptance network provides an effective feature extraction mode, for high-dimensional image data, an effective representation form of the image in a class space can be obtained through the mapping of the acceptance network, and the redundant dimension of the data is reduced. Wherein a category space may be understood as a collection of multiple categories. The method and the device map data to a class space by using an acceptance network, and provide a measurement distance between the data in the class space, namely an acceptance _ JS (acceptance jensen-shannon divergence).

In one embodiment, the data of the original data set is subjected to feedforward calculation through a pre-trained inclusion network, and an output vector of the last layer of the inclusion network, namely the softmax layer, is obtained. Each element in the output vector represents a probability value that data input to the initiation network belong to a different class, and the entire vector can be regarded as a discrete probability distribution density function. Therefore, the distance of the probability distribution can be measured using jensen-shannon divergence (divergence). In one example, the jensen-shannon divergence is a symmetry metric that measures the difference between two probability distributions. Therefore, after the data of the original data set passes through the interception network to obtain probability distribution vectors in the category space, the distance between the probability distributions can be measured by using jensen-shannon divergence, that is, the inclusion _ JS is given. In one example, inclusion _ JS can be shown as follows,

p_x＝Inception(x),p_y＝Inception(y)

wherein, the initiation () represents the initiation network, p_xObtaining probability distribution vector p under class space for data x through mapping of initiation network_yAnd mapping the data y through an acceptance network to obtain a probability distribution vector under the category space. Incep _ JS (x, y) is the distance between the two samples calculated to give x and y.

When the diversity of the same category data set is to be measured, the inclusion _ JS distance between any two samples in the data set needs to be calculated, and then the average value is calculated. Thus, the present application defines the sum of the inclusion _ JS divergence of any two samples of the dataset, namely Mean _ inclusion _ JS. In one example, Mean _ inclusion _ JS can be shown as follows,

wherein (x, y) to p_rFor any two sample data in the dataset.

It can be understood that Mean _ inclusion _ JS can be used to describe the diversity of the data set, and when the value of Mean _ inclusion _ JS is larger, the diversity of the data set is higher.

As shown in fig. 5, fig. 5 is a schematic diagram of diversity evaluation of average port janson shannon divergence according to an embodiment of the present application. It can be seen that the inclusion _ JS divergence provides a more efficient measurement. When the data set is evaluated, the value of Mean _ inclusion _ JS becomes larger as the number of samples increases while the number of categories of the data set remains the same. For example, the abscissa shown in fig. 5 represents the number of samples, and the ordinate represents the value of Mean _ inclusion _ JS. When the number of evaluated samples reaches a certain value, the diversity of the data set reaches a saturation state. It can be seen that the diversity values of the data set also approach a plateau at this time. Obviously, Mean _ inclusion _ JS appears to be very effective in evaluating data set diversity.

And S403, performing feature extraction on the data of the original data set.

In one embodiment, the data diversity evaluation device 201 will perform feature extraction on the data of the original data set.

In one example, the data diversity evaluation device 201 performs feature extraction on data in the original data set. The feature extraction is a concept in vision and image processing, and refers to extracting image information by using a terminal device and converting the image information into a digital vector for representation. Because the data in the original data set is complicated and a large amount of irrelevant information may exist, the information irrelevant to the characteristics in the original data set can be omitted by extracting the corresponding characteristics of the data in the original data set, and the dimension reduction of the data is realized so as to reduce the data amount of subsequent calculation.

In one example, after S403, S405' (not shown) may be performed directly

S405', first expansion data is generated by using the original data set.

In one embodiment, the GAN module 2022 in the GAN generation data device 202 generates the first augmented data using the data in the raw data set. Wherein the first augmented data is generated from the original data set by GAN. In one example, the first augmented data may be represented by X'.

S406 may be performed after S405'.

In one example, because the data in the original data set is various and complex, there is a certain probability that the generated data cannot increase the diversity of the original data set well, and the data can be added for better generation. The present application may also perform S404 after S403.

S404, detecting the data of the original data set after feature extraction by adopting a local outlier factor LOF algorithm to obtain an outlier set.

In one embodiment, the LOF module 2021 in the GAN generating data apparatus 202 performs outlier detection on the data of the original data set after feature extraction by using a local outlier factor LOF algorithm, so as to obtain an outlier set.

In one example, the LOF module 2021 in the GAN generation data device 202 receives the data of the original data set after feature extraction sent by the data set diversity evaluation device 201, and then performs outlier detection by using an LOF algorithm, so as to detect all outlier data in the data of the original data set after feature extraction, and form the detected outlier data into an outlier set. It should be noted by those skilled in the art that in feature space, the proximity between an outlier and its nearest neighbor deviates significantly from the proximity between other data in the data set and their own nearest neighbors; the outlier set is a set of all outliers, so it can be generally considered that the outliers are samples in the data set, which makes the classification method easily make a judgment error and has a small number.

In another example, for example, as shown in fig. 6a, fig. 6a is a schematic diagram of a two-dimensional distribution of an original data set according to an embodiment of the present application. First, for all pictures in the original data set, the data is mapped to a two-dimensional plane using equidistant feature mapping (isomap). Among them, isomap is an unsupervised algorithm for non-linear data dimension reduction. Each picture in the original data set is then mapped to a point and forms a set of points as shown in fig. 6 b. Fig. 6b is a schematic distribution diagram of a two-dimensional point set provided in this embodiment, where after the data obtained by feature extraction in fig. 6b is mapped to a two-dimensional plane, points with the same category are aggregated into a point set of the category, for example, the shades of different colors in fig. 6b represent point sets of different categories. The point set in fig. 6b is then outlier detected by the LOF algorithm, resulting in an outlier two-dimensional distribution as shown in the right-hand graph of fig. 6 c. Fig. 6c is a schematic diagram of two-dimensional distribution of outlier clusters according to the embodiment of the present application, and shows coordinates of all points in the right-side coordinate diagram of fig. 6c, which are distributed by using the X-axis as the sample point id mark and the Y-axis as the abnormal point factor, that is, the left-side coordinate diagram of fig. 6c shows. It can be seen that assuming the Y-axis value 250 as the boundary, the four points above are outliers, which may correspond to the four outliers shown in the right-hand graph of fig. 6c with a light gray circle as the background. It should be noted that fig. 6 a-6 c only illustrate 20 categories selected from the original data set.

Continuing back to fig. 4.

S405, first expansion data is generated by using the outlier set.

In one embodiment, the GAN module 2022 in the GAN generation data device 202 generates the first augmented data using the set of outliers obtained by the LOF module 2021. Wherein the first augmented data is generated by the outlier cluster via the GAN. In one example, the first augmented data may be represented by X'.

In some aspects, it implements an image enhancement method for supervised learning applications. Supervised learning is a process of adjusting the parameters of a classifier to achieve required performance by using a set of samples of known classes. Firstly, acquiring an original data set to be enhanced and marking information; then setting data enhancement parameters according to the supervised learning application task; then, respectively carrying out data enhancement processing on the image data and the label data in the original data set; then, screening the enhanced image and the marking information; and finally, storing the screened image and the label information into an original data set. The data enhancement modes of the scheme have various types, and the specific selection of the data enhancement mode has strong correlation with specific tasks. Manual setting is required and automated processing cannot be performed. Meanwhile, various parameters in various data enhancement modes need to be set, and adaptive adjustment cannot be realized.

Therefore, the self-adaptive adjustment is realized by adopting the GAN to perform the self-learning and automatically correcting the parameters through the loss function continuously. The loss function is a function that maps the value of a random event or its related random variable to a non-negative real number to represent the "risk" or "loss" of the random event. In the field of machine learning, learning by a loss function is a method of evaluating the degree to which a particular algorithm models given data. If the predicted value deviates far from the actual result, the loss function will get a very large value. With the aid of some optimization functions, the loss function gradually learns to reduce the error of the predicted value. In general, a GAN includes two network structures: generators and discriminators, which may also be referred to as generating networks and discriminating networks. The generated network is a network structure which utilizes random variables to generate images, and the images generated under ideal conditions are similar to real images and cannot be distinguished. And a discriminative network is a metric network used to distinguish between real images and generated images. For an ideal steady state of GAN, nash equilibrium is reached, so that the probability distribution function of the generated image approximates to the true distribution. However, there are still many potential problems for GAN, for example, in a high-dimensional space, the real image distribution function and the generated image distribution function usually do not intersect, and in this case, the general distance function, such as the relative entropy, is usually a constant and causes the gradient to assume a random direction, so that the training is unstable and difficult to converge. The relative entropy can also be called kullback-leibler (KL) divergence, which can be used as an asymmetry measure for the difference between two probability distributions. The existing GAN has many defects in the task of synthesizing a real scene, such as large training difficulty, unstable training process, low quality of generated images, weak semantic information, and the like.

GAN is a non-supervised learning method, which learns by means of two neural networks playing games with each other. For example, as shown in fig. 7, fig. 7 is a diagram of a generation countermeasure network structure provided in an embodiment of the present application. The GAN is composed of a generation network G and a discrimination network D. In one example, the GAN is randomly sampled from a potential space (latency space) to obtain random data as input to generate the network G. Wherein the latent space is the eigenspace of the high-dimensional data. The output of the generating network G must mimic as much as possible the real sample data in the training set. For the judgment network, the input is the real data and the result of the generated network output, and the real data and the result of the generated network output are judged so as to judge the result of the generated network output. In the GAN training stage, the loss function is continuously adjusted, and parameters are continuously adjusted through mutual confrontation between the two networks, so that the discrimination network cannot finally discriminate whether the data output by the generation network is true or false. The loss function is a function that maps the value of a random event or its related random variable to a non-negative real number to represent the "risk" or "loss" of the random event.

As mentioned above, because the real image distribution function and the generated image distribution function usually do not intersect in the high-dimensional space, the training process of GAN is unstable, resulting in a difficult training process and easily generating blurred low-quality images. In this case, the distance function in GAN is usually constant and results in a random direction of gradient, unstable training and difficult convergence. The distance function is generally a function of converting a loss function into a distance between an image generated by the generation network and a real image in the GAN to find a maximum value of the discrimination network D.

The present application therefore constructs a new loss function by adding inclusion _ JS and regularization term (regularization term) of Kernel maximum mean variance (Kernel _ MMD) to control the degree of variance between the true image probability distribution and the generated image probability distribution. The regularization term is a phenomenon that often causes an overfitting (over training) when training data is insufficient or over training (over training). In this case, the regularization term introduces extra information into the original model to prevent the generation of overfitting and improve the generalization performance of the model. Meanwhile, the similarity of the two probability distributions can be measured through the inclusion _ JS, the calculation is simple, and the asymmetry problem of the KL divergence is solved.

In one example, the GAN module 2022 generates a confrontation new image, i.e., the first augmented data, by GAN using the outlier set output by the LOF module 2021. In another example, the GAN module 2022 may further include a port jackson shannon divergence sub-module 20221 and a kernel maximum average difference sub-module 20222. Among other things, the port jackson shannon divergence sub-module 20221 may employ inclusion _ JS and the Kernel maximum mean difference sub-module 20222 may employ Kernel _ MMD to control the difference between the generated image and the real image.

In another example, the loss function constructed in the GAN module 2022 can be expressed as follows,

wherein z to p_zRepresenting random data, x-p_rRepresenting real data, G () representing output data of the generating network G, D () representing output data of the discriminating network D, K _ MMD (), i.e. Kernel _ MMD described above. Beta and gamma are [0,1 ]]The super parameter is a parameter which is preset and cannot be changed along with the training process.

In one example, the objective function of Kernel _ MMD may be used to control the generation of the image density distribution function p_gIs close to the real imageDegree distribution function p_rThe difference in (a). Wherein the generated image density distribution function represents a density distribution function between a plurality of images in the generated data generated by the GAN; the real image density distribution function represents a density distribution function of a plurality of images in the real data. For example Kernel MMD can be represented by the following formula,

where K (,) is a kernel function, a gaussian kernel function, a laplacian kernel function, or the like can be generally used. The kernel function may include a linear kernel function, a polynomial kernel function, and a gaussian kernel function, and the kernel function may map data to an infinite dimension, which may also be called a radial basis function. It should be noted, of course, that any equivalent alternative function may be used, and the application is not limited thereto.

In one embodiment, the GAN includes two network structures, the generation network G and the discrimination network D, which both need to follow some design principles, and may include, for example:

(1) it is determined that the network structure of the network D cannot use the pooling layer, but only the convolutional layers (transposed layers) are used, and the network structure of the network G needs to use the transposed layers (transposed layers).

(2) Both the generation network G and the discrimination network D need to use batch normalization (batch normalization) layers, so that the phenomenon that the data distribution of internal nodes changes due to network parameter changes in the network training process is solved, and a full connection layer is not used as much as possible.

(3) The intermediate layer activation function of the generation network G may use a linear rectification function (ReLu), and the last layer of the generation network G uses the activation function Tanh.

(4) Each layer in the discrimination network D uses an optimized ReLu function, i.e. a LeakyReLu function.

In one example, to generate a 64 x 64 image, for example, the input to generate the network G may be a random vector of dimension 100. In another example, the input may be a discrete point group of the LOF module output. The network structure of the specific generation network G and the discrimination network D may be as shown in table 1,

TABLE 1

It can be seen that in the generated network G, the 100-dimensional random vector input from the first Convtranspose layer is repeatedly convolved by the BatchNorm layer, the ReLU layer, and the Tanh layer, and finally outputs data with a dimension of 3. And then, taking the output of the generated network as the input of the discrimination network D, and finally outputting data with the dimension of 1 through repeated convolution of the BatchNorm layer, the LeakyReLU layer and the Sigmoid layer, namely judging whether the input of the discrimination network D is true or false. In the network structure shown in table 1, the length and width of the convolution kernel of the convolutional layer may be 4 × 4, and of course, those skilled in the art should note that the length and width of the convolution kernel may be other combinations, and the present application is not limited herein.

As shown in fig. 8, fig. 8 is a schematic diagram of generating data of a countermeasure network according to an embodiment of the present application. As can be seen from fig. 8, the picture generated by the GAN module 2022 ideally should be very similar to the real picture, i.e. the original picture, so that the discrimination network D cannot discriminate the real situation of the picture generated by the generation network G.

According to the method and the device, the new data are generated by inputting the cluster point set into the GAN, and the cluster point set is obtained by calculating the original data set, so that the problems that part of the current FAN is only suitable for a specific task scene and the universality is not strong are solved.

Now, we proceed back to fig. 4.

S406, combining the first expansion data and the data in the original data set, and evaluating by adopting average port Jersen Shannon divergence to obtain a second index.

In one embodiment, the GAN generation data device 202 combines the first augmented data generated by the GAN module 2022 with the original data, and then the second average port johnson divergence module 2023 evaluates the second index using the average port johnson divergence. Wherein the second index is used for representing the diversity of the original data set combined with the first extended data.

In one example, the GAN generation data device 202 superimposes the first augmented data generated by the GAN module 2022 with the data in the original data set, and then the second average port jensen shannon divergence module 2023 can use Mean _ inclusion _ JS for diversity evaluation to obtain the second index. Where the second index may be denoted by M2.

S407, determining whether the second index is larger than the first index.

In one embodiment, the second average port jensen shannon divergence module determines whether the second index is greater than the first index, and when the second index is greater than the first index, determines that the diversity of the original data set after the first extension data is added is significantly improved, and continues to execute S408. In one example, when M2 is greater than M1, it is assumed that the diversity of the original data set is significantly improved by adding the first augmented data generated by GAN module 2022. By the method, the diversity of the data set can be continuously maintained by the new data added into the original data set, and the diversity of the data set is not reduced due to the introduction of the new data, so that the problem of data redundancy is solved.

In another embodiment, if it is determined that the second index is smaller than or equal to the first index, it is determined that the diversity of the original data set after the first augmented data is added is not significantly improved, and then the process may return to S404, and a threshold in the LOF algorithm is changed to obtain a new outlier set again, and a new augmented data is obtained through calculation.

S408, adding the first expansion data into the original data set to obtain a first data set.

In one embodiment, first augmented data is added to the original data set and a first data set is formed. In one example, the first data set may also be referred to as a new data set.

S409, the original data set is replaced with the first data set.

In one embodiment, the original data set is replaced with the first data set. In one example, since the added first augmented data enhances the diversity of the original data set, the first augmented data is added to the original data set to form a new data set, and the original data set is replaced with the new data set. In another example, the first augmented data may also be added directly to the original data set to form a new original data set, and used as the original data set in a subsequent operation. It should be noted by those skilled in the art that the core purpose of S409 is to add the generated first augmented data to the original data set, so as to achieve continuous augmentation of the original data set. The application is not limited with respect to whether there are specific alternative steps.

In one example, the data of the data set is classified by a classifier, and the completeness and diversity of the training data set are verified by the accuracy of classification of the classifier. The classifier is a classification function or a constructed classification model which is learned on the basis of existing data and generated. For a complete training data set with high diversity, the classifier is trained, and the accuracy of the classification result of the classifier can be ensured to be high. For example, fig. 9 shows a comparison diagram of the classification accuracy of a data set provided in an embodiment of the present application in fig. 9. In fig. 9, an original data set, an enhanced data set expanded by using a conventional enhanced data method, and an enhanced data set expanded by using the method shown in fig. 4 of the present application are shown, and the above different data sets are used to train an extreme gradient boosting (xgboost) classifier, respectively, and obtain a curve of the test accuracy. Where the abscissa of fig. 9 identifies the number of iterations of the training process and the ordinate represents the classification accuracy. As shown in fig. 9, the first plot line shows the accuracy curve of the original data set, the second plot line shows the accuracy curve of the conventional enhanced data set, and the third plot line shows the accuracy curve of the enhanced data set of the present application. The original data set has 40000 sample data, the traditional enhanced data set has 50000 sample data, and the enhanced data set of the present application has 50000 sample data. It can be seen that the original data set is continuously expanded by the method disclosed by the application, so that the data in the data set can be expanded and the diversity of the data can be ensured. After the classifier is trained by using the expanded data set, the accuracy of the classifier can be greatly improved. Under the same iteration times, the accuracy of the classifier trained by the enhanced data set greatly exceeds that of the classifier trained by the original data set. And under the same condition of 50000 sample data, the accuracy of the classifier trained by the enhanced data set still exceeds that of the classifier trained by the traditional enhanced data set. It can be seen that the original data set is expanded by the method disclosed by the application, so that the completeness of the data set is high, and the generalization capability of the classifier can be effectively improved.

The application introduces the Incepotion _ JS measurement and Kernel _ MMD as regular terms. And evaluating the similarity between the image generated by the generating network G and the real image in the loss function of the GAN, and realizing a stable training process. Meanwhile, the generated image is evaluated to determine whether the generated image is suitable for being added into the original data set by means of Mean _ inclusion _ JS, namely, when new data is generated and added into the original data set, so that the Mean _ inclusion _ JS of the data set is increased, the newly generated data is added. Therefore, the problem of data redundancy can be solved, and the problem of expanding the diversity of data can be solved because the original data set is added when the diversity of the new data is increased.

As shown in fig. 10, fig. 10 is a flowchart of another data processing method provided in the embodiment of the present application.

As shown in fig. 10, after S402, if the data in the original data set has a tag, the following steps may be further performed:

s1001, aiming at the labels in the original data set, introducing the data with the same labels from the public data set

In some aspects, it implements an image data enhancement scheme. Firstly, an image to be enhanced and an auxiliary image are obtained, wherein the image to be enhanced and the auxiliary image have the same or similar data distribution. And then, carrying out preset color channel superposition on the image to be enhanced by using the auxiliary image. And finally, determining the image obtained by superposing the preset color channels as a generated enhanced image. Among them, a channel that holds image color information is called a color channel. However, in this scheme, simple color channel superposition may cause the original features of the image to vary too much, and may possibly change the semantic information of the image. Meanwhile, information irrelevant to the category itself can be introduced in a color channel overlapping mode, so that the classification accuracy of the neural network is reduced.

Therefore, the data can be expanded by identifying the class label in the original data set, so that the introduced data does not influence the classification accuracy of the neural network.

In one embodiment, if the data in the original data set has a tag, the public data set importing apparatus may extract the data having the same tag from the public data set as the imported new data. Wherein the same label, i.e. the label of the data in the public data set is the same as the label of the data in the first data set; in another example, the same tags may also include similar tags, i.e., tags that disclose data in the data set belong to a subset of tags that disclose data in the data set. In one example, the data of the first data set is labeled "light," for example, and the data imported from the public data set is also labeled "light," or may be a subset of the labels "light," such as "desk lamp," "floor lamp," "ceiling lamp," or the like. It should be noted by those skilled in the art that labels, whether identical or similar, may be referred to herein as identical labels.

In one example, if the data in the original data set has tags, before performing S1001, in S402, diversity evaluation may also be performed using Mean _ inclusion _ JS for the data of each tag category in the original data set. Then, S1001 is performed for data of different tag categories, respectively.

And S1002, performing feature extraction on the data in the public data sets with the same label.

In one embodiment, feature extraction is performed on data in public datasets having the same label.

In one example, the public data set importing apparatus 203 performs feature extraction on data having the same tag in the public data set. Therefore, information irrelevant to the characteristics in the data set is omitted, and dimension reduction of the data is realized so as to reduce the data volume of subsequent calculation.

S1003, combining the data in the public data set and the data in the original data set with the same label, and evaluating by adopting average port Jersen Shannon divergence to obtain a third index.

In one embodiment, the public data set importing apparatus 203 aggregates the data with the same label in the public data set with the original data set, and then the third average port janon divergence module 2032 evaluates the average port janon divergence to obtain a third index. Wherein the third index is used to represent the diversity of the original data set after combining data in the public data sets having the same label.

In one example, the public data set importing apparatus 203 superimposes the data with the same label in the public data set and the data in the original data set, and then the third average port jensen shannon divergence module 2032 may perform diversity evaluation by using Mean _ inclusion _ JS to obtain a third index. Wherein the third index may be denoted by M3.

S1004, it is determined whether the third index is greater than the first index.

In one embodiment, the third average port jensen shannon divergence module 2032 determines whether the third index is greater than the first index, and when the third index is greater than the first index, it is determined that the diversity of the first data set after adding data having the same label in the public data set is significantly improved, and S1005 is continuously performed. In one example, when M3 is greater than M1, it is assumed that the diversity of the original data set is significantly improved upon the addition of data having the same tag in the public data set. By the method, the diversity of the data set can be continuously maintained by the new data added into the original data set, and the diversity of the data set is not reduced due to the introduction of the new data, so that the problem of data redundancy is solved.

In another embodiment, if it is determined that the third index is less than or equal to the first index, it is determined that the diversity of the original data set after adding the data with the same tag in the public data set is not significantly improved, and then the method may return to S1001 and select another public data set and obtain new extended data.

S1005, using the data in the public data set with the same tag as the second extension data.

In one embodiment, if the third index is greater than the first index, the data in the public data set having the same tag is used as the second augmented data. In one example, the second augmented data may be represented by X ".

S1006, adding the second expansion data into the original data set to obtain a second data set.

In one embodiment, the second augmented data is added to the original data set and a second data set is formed. In one example, the second data set may also be referred to as a new data set.

S1007, the original data set is replaced with the second data set.

In one embodiment, the original data set is replaced with the second data set. In one example, since the added second augmented data enhances the diversity of the original data set, the second augmented data is added to the original data set to form a new data set, and the original data set is replaced with the new data set. In another example, the second expansion data may be directly added to the original data set to form a new original data set, and the new original data set may be used as the original data set in the subsequent operation. It should be noted by those skilled in the art that the core purpose of S1007 is to add the generated second expansion data to the original data set, so as to continuously expand the original data set. The application is not limited with respect to whether there are specific alternative steps.

It should be noted by those skilled in the art that the method shown in fig. 10 is to perform data expansion on a certain category of tags in the original data set, and when there are multiple categories of tags in the original data set, the method shown in fig. 10 may be performed for each category of tags.

As shown in fig. 11, fig. 11 is a flowchart of another data processing method provided in the embodiment of the present application.

As shown in fig. 11, after S1005, the following steps may also be performed for the second augmented data:

s1101, generates third extended data by using GAN for the second extended data.

In one embodiment, the second augmented data is input to the GAN module 2022 in the GAN generating data device 202, so that the GAN module 2022 generates the third augmented data with the second augmented data as input.

In one example, X "is input to the GAN module 2022 as an input, and the GAN module 2022 then generates third augmented data according to the input X". Wherein the third augmentation data may also be referred to as new data. In one example, the same symbol X' as the first extension data may be used. It should be noted that in the present application, X' is used to indicate that all GAN modules generate the augmented data, which may be the first augmented data or the third augmented data. X "is used to indicate that all of the augmented data generated by the non-GAN modules may be the second augmented data.

And S1102, combining the third expansion data and the data in the original data set, and evaluating by adopting average port Jersen Shannon divergence to obtain a fourth index.

In one embodiment, the GAN data generating device 202 combines the third augmented data with the original data, and then the second average port jackson shannon divergence module 2023 evaluates the average port jackson shannon divergence to obtain a fourth index. And the fourth index is used for representing the diversity of the original data set after the third expansion data is combined.

In one example, the GAN data generating device 202 superimposes the third augmented data with the data in the original data set, and then the second average port jackson shannon divergence module 2023 may perform diversity evaluation by using Mean _ inclusion _ JS to obtain a fourth index. Where the fourth index may be denoted by M4.

S1103, it is determined whether the fourth index is greater than the first index.

In one embodiment, the second average port jensen shannon divergence module 2023 determines whether the fourth metric is greater than the first metric, and if the fourth metric is greater than the first metric, it is determined that the diversity of the original data set after the third augmented data is added is significantly improved, and continues to execute S1104. In one example, when M4 is greater than M1, it can be assumed that the diversity of the original data set is significantly improved after the third augmented data is added. By the method, the diversity of the data set can be continuously maintained by the new data added into the original data set, and the diversity of the data set is not reduced due to the introduction of the new data, so that the problem of data redundancy is solved.

In another embodiment, if it is determined that the fourth index is less than or equal to the first index, it is determined that the diversity of the first data set after the third augmented data is added is not significantly improved. At this point, the process may return to S1001, and another public data set may be selected and new augmented data may be acquired.

And S1104, adding the third expansion data to the original data set to obtain a third data set.

In one embodiment, third augmented data is added to the original data set and a third data set is formed. In one example, the third data set may also be referred to as a new data set.

S1105, the original data set is replaced with the third data set.

In one embodiment, the original data set is replaced with the third data set. In one example, since the added third augmented data enhances the diversity of the original data set, the third augmented data is added to the original data set to form a new data set, and the original data set is replaced with the new data set. In another example, the third expansion data may be directly added to the original data set to form a new original data set, and the new original data set may be used as the original data set in the subsequent operation. It should be noted by those skilled in the art that the core purpose of S1105 is to add the generated third augmented data to the original data set, so as to achieve continuous augmentation of the original data set. The application is not limited with respect to whether there are specific alternative steps.

The method shown in fig. 2 to 11 of the present application can measure the distance between samples in the class space and ensure the effectiveness of distance measurement by proposing a way that the distance between high-dimensional data samples can be measured, namely, the inclusion _ JS divergence. Meanwhile, on the basis, an index Mean _ inclusion _ JS for measuring the diversity of the data set is provided. According to the method, firstly, an LOF algorithm is utilized to screen out the outlier set, and then the screened outlier set is combined with GAN to generate new data. Meanwhile, the regular terms of Incepotion _ JS and Kernel _ MMD are introduced into the loss function of the GAN to control the difference between the distribution of the generated image and the real image. And when newly generated data is introduced into the original data set, evaluating whether the diversity is improved or not by using the Mean _ inclusion _ JS index. The method and the device have the advantages that the original data set is expanded by adding the samples from the public data set, Mean _ inclusion _ JS is used as an evaluation index, and the samples which enable the diversity of the original data set to be gained are selected to be added, so that the redundant data are prevented from being added, and negative influence is generated on training of a network model.

It should be noted by those skilled in the art that the GAN network structure involved in the technical solution of the present application may also be used for generating scenes such as medical images, and the idea of the overall solution may also be used for continuous acquisition of data in other formats in an intelligent data system, and the present application is not limited herein.

It should be noted by those skilled in the art that the data types in the original data set and the new data set in the present application may be pictures, audio, text, etc. When the data type is a non-picture, only the initiation network needs to be replaced by other networks applicable to the data type. Meanwhile, the original data set is expanded by the expanded data, so that a more complete classification model can be trained by the expanded original data, and more accurate image classification and image identification can be realized.

As shown in fig. 12, fig. 12 is a schematic diagram of a data processing apparatus according to an embodiment of the present application.

Fig. 12 shows a data processing apparatus 1200, the apparatus 1200 comprising: an obtaining module 1201, configured to obtain an original data set; an evaluation module 1202, configured to evaluate the first data set by using average port jackson shannon divergence to obtain a first index, where the first index is used to represent diversity of an original data set; a first generating module 1205, configured to input data in the original data set to the generation countermeasure network GAN and generate first augmented data; the evaluation module 1202 is further configured to combine the first expansion data and the data in the original data set, and evaluate by using average port jensen shannon divergence to obtain a second index, where the second index is used to indicate diversity of the original data set after the first expansion data is combined; an adding module 1206, configured to add the first expanded data to the original data set to obtain a first data set when the second index is greater than the first index; a replacing module 1207 for replacing the original data set with the first data set.

In one possible implementation, the apparatus 1200 further comprises: a feature extraction module 1203, configured to perform feature extraction on data of the original data set; a detection module 1204, configured to detect data of the original data set after feature extraction by using a local outlier factor LOF algorithm, to obtain an outlier set; the first generating module 1205 is further configured to input the data in the outlier set to the GAN and generate the first augmented data.

In one possible embodiment, the first generating module 1205 includes: generating an antagonistic network GAN; the GAN employs a loss function having a plurality of parameters, wherein the parameters include port jackson shannon divergence and nuclear maximum mean difference.

In one possible implementation, the apparatus 1200 further comprises: an importing module 1209, configured to import, from the public dataset, data having the same tag with respect to the tag included in the original dataset; the feature extraction module 1203 is further configured to perform feature extraction on data in the public data sets with the same label; the evaluation module 1202 is further configured to combine data in the public data sets with the same label and data in the original data set, and evaluate by using average port jackson shannon divergence to obtain a third index, where the third index is used to indicate diversity of the original data set after combining the data in the public data sets with the same label; a second generating module 1208, configured to take the data in the public data set with the same tag as the second augmented data when the third index is greater than the first index.

In one possible embodiment, having the same tag includes: the tags in the public data set are the same as the tags in the first data set; or that the tags in the public data set belong to a subset of the tags in the original data set.

In one possible implementation, the adding module 1206 is further configured to: adding the second expansion data into the original data set to obtain a second data set; the replacement module 1307 is also used to replace the original data set with the second data set.

In one possible embodiment, the first generating module 1205 is further configured to: generating third expansion data by adopting GAN for the second expansion data; the evaluation module is further used for combining the third expansion data and the data in the original data set and evaluating by adopting average port Jersen Shannon divergence to obtain a fourth index, and the fourth index is used for representing the diversity of the original data set after the third expansion data is combined; the adding module 1206 is further configured to add the third expanded data to the original data set to obtain a third data set when the fourth index is greater than the first index; the replacement module 1207 is further configured to replace the original data set with the third data set.

As shown in fig. 13, fig. 13 is a schematic diagram of a terminal device according to an embodiment of the present application.

Fig. 13 provides a terminal device 1300, which device 1300 may include a processor 1301, a memory 1302, a communication interface 1303, and a bus 1304. The processor 1301, the memory 1302, and the communication interface 1303 in the terminal apparatus may establish communication connection through a bus 1304. The communication interface 1303 is used for transmitting and receiving external information.

Processor 1301 may be a Central Processing Unit (CPU).

Memory 1302 may include volatile memory (volatile memory), such as random-access memory (RAM); the memory 1302 may also include a non-volatile memory (e.g., a read-only memory (ROM), a flash memory, a Hard Disk Drive (HDD) or a Solid State Drive (SSD); the memory 1302 may also include a combination of the above types of memory.

The methods for acquiring diversity data provided in the embodiments of fig. 2 to 11 are all executed by the processor 1301. The file data and/or calculated data in this application will be stored in memory 1302. In addition, the memory 1302 will be used for storing program instructions and the like executed by the processor to implement a method for acquiring diversity data according to the embodiments of fig. 2 to 11.

The application relates to a data processing method and device, wherein an effective feature extraction mode is provided for a sample through an inclusion network by proposing an inclusion _ JS divergence, so that an effective representation form of an image in a class space can be obtained by mapping high-dimensional image data through the inclusion network, and an effective distance form among the high-dimensional data is obtained. And on the basis, Mean _ inclusion _ JS is provided, and the completeness and diversity of the data set can be effectively evaluated by means of a Mean _ inclusion _ JS index. According to the method, firstly, an LOF algorithm is utilized to screen out outlier sets, new data are generated by utilizing GAN, and regular terms of increment _ JS and Kernel _ MMD are introduced into a loss function of the GAN to control the difference between the generated image and the real image distribution. And when the newly generated data is introduced into the original data set, the Mean _ inclusion _ JS index is also required to be used for evaluation. The method and the device have the advantages that samples of the public data sets with the same labels are added, the samples enabling diversity of the original data sets to be gained are selected through Mean _ inclusion _ JS, and the fact that redundant data are added and negative effects are generated on training of the network model can be prevented.

By the method, diversity data can be continuously acquired, and complete and diverse data sets can be obtained. Meanwhile, the generalization capability of the network model can be effectively improved by the obtained complete training data set.

It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data processing, the method comprising:

acquiring an original data set;

evaluating the original data set by adopting average port Jersen Shannon divergence to obtain a first index, wherein the first index is used for expressing the diversity of the original data set;

inputting the data in the original data set to a generation countermeasure network GAN and generating first expansion data;

combining the first expansion data and the data in the original data set, and evaluating the average port Jersen Shannon divergence to obtain a second index, wherein the second index is used for representing the diversity of the original data set after the first expansion data is combined;

when the second index is larger than the first index, adding the first expansion data into the original data set to obtain a first data set;

replacing the original data set with the first data set.

2. The method of claim 1, wherein after said obtaining the first indicator, the method further comprises:

performing feature extraction on the data of the original data set;

detecting the data of the original data set after feature extraction by adopting a local outlier factor LOF algorithm to obtain an outlier set;

inputting data in the outlier set to a GAN and generating the first augmented data.

3. The method of claim 1 or 2, wherein the loss function employed to generate the antagonistic network GAN has a plurality of parameters, wherein the plurality of parameters includes port jackson shannon divergence and nuclear maximum mean difference.

4. The method of claim 1, wherein the method further comprises:

for the labels in the original data set, introducing data with the same labels from the public data set;

performing feature extraction on data in the public data sets with the same label;

combining the data in the public data sets with the same label and the data in the first data set, and evaluating by adopting the average port Jersen Shannon divergence to obtain a third index, wherein the third index is used for representing the diversity of the first data set after combining the data in the public data sets with the same label;

and when the third index is larger than the first index, taking the data in the public data set with the same label as second expansion data.

5. The method of claim 4, the having the same tag comprising:

the tags in the public dataset are the same as the tags in the original dataset; or

The tags in the public data set belong to a subset of the tags in the original data set.

6. The method of claim 4 or 5, wherein the method further comprises:

adding the second expansion data to the original data set to obtain a second data set;

replacing the original data set with the second data set.

7. The method of claim 4 or 5, wherein the method further comprises:

generating third expansion data by adopting the GAN for the second expansion data;

combining the third expansion data and the data in the original data set, and evaluating the average port Jersen Shannon divergence to obtain a fourth index, wherein the fourth index is used for representing the diversity of the original data set after the third expansion data is combined;

when the fourth index is larger than the first index, adding the third expansion data into the original data set to obtain a third data set;

replacing the original data set with the third data set.

8. The method of any of claims 1-7, wherein a data type of data in the original data set, the first data set, the second data set, and/or the third data set is a picture type.

9. A data processing apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring an original data set;

the evaluation module is used for evaluating the original data set by adopting average port Jersen Shannon divergence to obtain a first index, and the first index is used for expressing the diversity of the original data set;

the first generation module is used for inputting the data in the original data set to a generation countermeasure network GAN and generating first expansion data;

the evaluation module is further configured to combine the first expansion data and the data in the original data set, and evaluate by using the average port jensen shannon divergence to obtain a second index, where the second index is used to indicate diversity of the original data set after the first expansion data is combined;

the adding module is used for adding the first expanded data into the original data set to obtain a first data set when the second index is larger than the first index;

a replacement module to replace the original data set with the first data set.

10. The apparatus of claim 9, wherein the apparatus further comprises:

the characteristic extraction module is used for extracting the characteristics of the data of the original data set;

the detection module is used for detecting the data of the original data set after the characteristic extraction by adopting a local outlier factor LOF algorithm to obtain an outlier set;

the first generating module is further configured to input data in the outlier set to a GAN and generate the first augmented data.

11. The apparatus of claim 9 or 10, wherein the first generating module comprises: generating an antagonistic network GAN;

the GAN employs a loss function having a plurality of parameters, wherein the parameters include port jackson shannon divergence and nuclear maximum mean difference.

12. The apparatus of claim 9, wherein the apparatus further comprises:

an importing module for importing data having the same label from a public data set for the label in the original data set;

the feature extraction module is further used for extracting features of the data in the public data sets with the same label;

the evaluation module is further configured to combine the data in the public data set with the same label and the data in the original data set, and evaluate by using the average port jackson shannon divergence to obtain a third index, where the third index is used to indicate diversity of the original data set after combining the data in the public data set with the same label;

and the second generation module is used for taking the data in the public data set with the same label as second expansion data when the third index is larger than the first index.

13. The apparatus of claim 12, the having identical tags comprising:

14. The apparatus of claim 12 or 13, wherein the adding module is further to:

the replacement module is further configured to replace the original data set with the second data set.

15. The apparatus of claim 12 or 13, wherein the first generating module is further to:

the evaluation module is further configured to combine the third expansion data and the data in the original data set, and evaluate by using the average port jackson shannon divergence to obtain a fourth index, where the fourth index is used to indicate diversity of the original data set after the third expansion data is combined;

the adding module is further configured to add the third expansion data to the original data set to obtain a third data set when the fourth index is greater than the first index;

the replacement module is further configured to replace the original data set with the third data set.

16. The apparatus of any of claims 9-15, wherein a data type of data in the original data set, the first data set, the second data set, and/or the third data set is a picture type.

17. A computer-readable storage medium having instructions stored thereon, which, when run on a terminal, cause the terminal to perform the method of any one of claims 1-8.

18. A computer system, comprising a processor coupled to a memory, the processor reading and executing instructions in the memory such that the computer system implements the method of any of claims 1-8.