CN112115131A - Data denoising method, device and equipment and computer readable storage medium - Google Patents

Data denoising method, device and equipment and computer readable storage medium Download PDF

Info

Publication number
CN112115131A
CN112115131A CN202011048598.3A CN202011048598A CN112115131A CN 112115131 A CN112115131 A CN 112115131A CN 202011048598 A CN202011048598 A CN 202011048598A CN 112115131 A CN112115131 A CN 112115131A
Authority
CN
China
Prior art keywords
data
denoised
data set
trained
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011048598.3A
Other languages
Chinese (zh)
Inventor
蔡成飞
田上萱
王红法
郭春超
赵文哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011048598.3A priority Critical patent/CN112115131A/en
Publication of CN112115131A publication Critical patent/CN112115131A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • G06Q30/0269Targeted advertisements based on user profile or attribute

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Data Mining & Analysis (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Evolutionary Computation (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Linguistics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Image Processing (AREA)

Abstract

The embodiment of the application provides a data denoising method, a data denoising device, data denoising equipment and a computer readable storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: training a preset algorithm model by adopting data in a data set to be denoised; when the trained algorithm model does not meet the model convergence condition, determining the trained algorithm model as a network model to be trained, and determining a first average distance between each data in a data set to be denoised in the data denoising request and other data; determining noise data according to the first average distance and removing the noise data to obtain an updated data set; inputting the data in the updated data set into a network model to be trained as sample data for model training; and when the trained network model meets the model convergence condition, determining the updated data set as a de-noised data set. Through the embodiment of the application, the accuracy rate of removing the noise data can be improved.

Description

Data denoising method, device and equipment and computer readable storage medium
Technical Field
The embodiment of the application relates to the technical field of internet, and relates to but is not limited to a data denoising method, a data denoising device, data denoising equipment and a computer readable storage medium.
Background
Deep learning is a field developed by big data drive, and all deep learning neural network algorithms face the problem of training data noise at present. If the noise data in the training data is too much, a deep learning algorithm with good effect cannot be obtained through training, so that the training data processing generally occupies 60-80% of the time in the whole algorithm design process, and the data denoising generally costs more manpower and material resources.
The noise data is mainly classified into simple noise data having a large difference in characteristics and difficult noise data having characteristics close to those of normal data. The commonly used denoising method mainly comprises manual denoising or denoising by adopting an algorithm solution mode.
However, although the manual denoising method can solve the two types of noise data, the time consumption is very long; the common algorithm solution is to perform distance comparison or category prediction comparison based on the extraction features of the pre-training model, and because the algorithm model in the algorithm solution has general accuracy, simple noise data can be removed, difficult noise data cannot be removed, and some normal data can be removed by mistake, so that the accuracy of removing the noise data is reduced.
Disclosure of Invention
The embodiment of the application provides a data denoising method, a data denoising device, data denoising equipment and a computer readable storage medium, and relates to the technical field of artificial intelligence. The method comprises the steps of calculating a first average distance between each datum and other data in a data set to be denoised, eliminating noise data with overlarge abnormal values, carrying out next round of training of a network model to be trained by using the rest data, and repeating the steps in a circulating mode until the network model to be trained is converged to obtain a data set which is denoised.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a data denoising method, which comprises the following steps:
training a preset algorithm model by adopting data in a data set to be denoised to obtain a trained algorithm model;
when the trained algorithm model does not meet the model convergence condition, determining the trained algorithm model as a network model to be trained, and determining a first average distance between each data in the data set to be denoised and other data in the data set to be denoised;
determining noise data in the data set to be denoised according to the first average distance;
removing the noise data in the data set to be denoised to obtain an updated data set;
inputting the data in the updated data set into a network model to be trained as sample data, and training the network model to be trained to obtain a trained network model;
and when the trained network model obtained by training the data in the updated data set meets the model convergence condition, determining the updated data set as a de-noised data set.
An embodiment of the present application provides a data denoising device, including:
the first training module is used for training a preset algorithm model by adopting data in a data set to be denoised to obtain a trained algorithm model;
the first determining module is used for determining the trained algorithm model as a network model to be trained and determining a first average distance between each data in the data set to be denoised and other data in the data set to be denoised when the trained algorithm model does not meet a model convergence condition;
a second determining module, configured to determine noise data in the data set to be denoised according to the first average distance;
the removing module is used for removing the noise data in the data set to be denoised to obtain an updated data set;
the training module is used for inputting the data in the updated data set into a network model to be trained as sample data, and training the network model to be trained to obtain a trained network model;
and the third determining module is used for determining the updated data set as a de-noised data set when the trained network model obtained by training the data in the updated data set meets the model convergence condition.
In some embodiments, the first determining module is further configured to:
extracting the characteristics of each data in the data set to be denoised to obtain the characteristic vector of each data; and calculating a first average distance between each data in the data set to be denoised and other data in the data set to be denoised according to the feature vector of each data.
In some embodiments, the first determining module is further configured to: acquiring the feature vector of each data and the total amount of data in the data set to be denoised; calculating the distance between each data and each other data through the feature vector of each data and the feature vectors of other data; and determining the first average distance of each data according to the distance between each data and each other data and the total amount of data in the data set to be denoised.
In some embodiments, the apparatus further comprises: a fourth determining module, configured to determine the updated data set as a current data set to be denoised when the trained network model obtained by training with the data in the updated data set does not satisfy the model convergence condition; a fifth determining module, configured to determine a second average distance between each data in the current data set to be denoised and other data in the current data set to be denoised; a sixth determining module, configured to determine, according to the second average distance, noise data in the current data set to be denoised; the second removing module is used for removing the noise data in the current data combination to be denoised to obtain a current updating data set; and the cyclic training module is used for cycling the step of obtaining the current updating data set until the trained network model obtained by training the data in the current updating data set meets the model convergence condition, and determining the current updating data set as the de-noised data set.
In some embodiments, the cycle training module is further to: and after the noise data in the data set to be denoised is determined each time, removing the noise data in the data set to be denoised to obtain the current updated data set.
In some embodiments, the apparatus further comprises: the classification module is used for classifying the data in the data set to be denoised after the data denoising request is received to obtain at least one data class to be denoised; and the noise removing module is used for removing noise of each data class to be denoised so as to obtain a denoised data set corresponding to each data class to be denoised.
In some embodiments, the second determination module is further configured to: and when the first average distance of any data is larger than a preset threshold value, determining the corresponding data as the noise data.
In some embodiments, the apparatus further comprises: the first input module is used for inputting the data in the updated data set into a network model to be trained as sample data, training the network model to be trained to obtain a trained network model, and then acquiring an output result of the trained network model; the second input module is used for inputting the output result into a preset loss model to obtain a loss result; and the judging module is used for determining whether the trained network model meets the model convergence condition according to the loss result.
In some embodiments, the apparatus further comprises: a seventh determining module, configured to determine a third average distance between each noise data removed in the previous round of data denoising and the data in the update data set; an eighth determining module, configured to determine, when the third average distance of any noise data is smaller than a preset threshold, corresponding noise data as recall data; the adding module is used for adding the recall data into the update data set to form an update data set with the recall data; and the retraining module is used for retraining the trained network model by adopting the updated data set with the recall data until the trained network model meets the model convergence condition.
In some embodiments, the apparatus further comprises: and the control module is used for not recalling any data when the data is determined to be the noise data and the removed times are greater than a time threshold value.
Embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium; the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor is configured to execute the computer instructions to implement the network structure searching method.
An embodiment of the present application provides a data denoising device, including:
a memory for storing executable instructions; and the processor is used for realizing the data denoising method when executing the executable instructions stored in the memory.
The embodiment of the application provides a computer-readable storage medium, which stores executable instructions and is used for causing a processor to execute the executable instructions to realize the data denoising method.
The embodiment of the application has the following beneficial effects: determining a first average distance between each data in the data set to be denoised and other data; determining noise data in a data set to be denoised according to the first average distance; removing noise data to obtain an updated data set; and inputting data in the updated data set into a network model to be trained as sample data, training the network model to be trained until the trained network model meets a model convergence condition, and obtaining a data set with the denoising completed. Therefore, the network model to be trained can be trained circularly, the data set to be denoised is denoised gradually, and the accuracy of removing noise data can be improved. Moreover, because the denoising method codes are coupled and embedded into various network models to be trained, additional human-computer interaction operation is not needed, and therefore a user can more conveniently and rapidly use the method of the embodiment of the application to denoise data and train the models.
Drawings
FIG. 1A is a schematic diagram of a training sample data distribution containing noisy data according to an embodiment of the present disclosure;
FIG. 1B is a flow chart of a manual denoising method in the related art;
FIG. 2 is a schematic diagram of an alternative architecture of a data denoising system according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a server provided in an embodiment of the present application;
FIG. 4 is a schematic flow chart of an alternative data denoising method according to an embodiment of the present disclosure;
FIG. 5 is a schematic flow chart of an alternative data denoising method according to an embodiment of the present disclosure;
FIG. 6 is a schematic flow chart of an alternative data denoising method according to an embodiment of the present disclosure;
FIG. 7 is a schematic flow chart of an alternative data denoising method according to an embodiment of the present disclosure;
FIG. 8 is a schematic flow chart of an alternative data denoising method according to an embodiment of the present disclosure;
FIG. 9 is a schematic flow chart of an alternative data denoising method according to an embodiment of the present application;
fig. 10 is a schematic main flow chart of a data denoising method provided in an embodiment of the present application;
FIG. 11 is a diagram of a learning architecture of a neural network model provided by an embodiment of the present application;
FIG. 12 is a schematic diagram showing the distribution comparison of sample data for training multiple rounds.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiments of the present application belong. The terminology used in the embodiments of the present application is for the purpose of describing the embodiments of the present application only and is not intended to be limiting of the present application.
For better understanding of the data denoising method provided in the embodiments of the present application, first, a description will be given of a data denoising method in noise data and related technologies:
fig. 1A is a schematic diagram of a distribution of training sample data including noise data according to an embodiment of the present application, where an origin point represents normal data, and a star represents the noise data, as shown in fig. 1A, the noise data is mainly classified into simple noise 101 (i.e., noise data outside a dashed circle 102 in the figure) with a large feature difference, and difficult noise 103 (i.e., noise data inside the dashed circle 102 in the figure) with a feature close to that of the normal data.
Based on the training sample data distribution diagram of fig. 1A, the noise removing method in the related art includes manual denoising and denoising by using an algorithm solution, wherein the two types of noise can be solved by general manpower, but the time consumption is very long; the common algorithm solving method is that distance comparison or category prediction comparison is carried out on the basis of the extracted features of the pre-training model, and the pre-training model is not provided with clean data to be learned, so that the accuracy of the algorithm model is general, and the difficult noise features are close to the features and categories of normal data, so that the difficult noise can be generally removed only by removing simple noise, and meanwhile, some normal data can be removed by mistake. Description is made with reference to fig. 1A: commonly used algorithms typically clean some simple noise outside the dashed circle 102, but fail to clean difficult noise inside the dashed circle 102, and also falsely clean some normal samples outside the dashed circle 102.
The currently common data denoising method is basically an offline data processing mode, and mainly comprises the following steps:
1) manual cleaning; as shown in fig. 1B, it is a flowchart of an artificial manual denoising method in the related art, and the method includes the following steps:
and S101, learning a pre-training algorithm model by using noisy data. And S102, classifying the original data by the algorithm model. Step S103, comparing the type of the original data with the prediction type. And step S104, judging whether the original data type is abnormal or not. If the judgment result is yes, executing step S105; if the judgment result is no, step S106 is executed. In step S105, the noise data is cleared. Step S106, keeping normal data. And S107, learning and training a final model by adopting de-noising data.
The method for manually clearing the noise data based on manual work can effectively remove all sample noises, but is extremely time-consuming and labor-consuming, and if the data is millions of data, the method usually needs tens of people for weeks or even months. Therefore, manual data noise removal is manually performed only in scenes with small data size or high algorithm accuracy requirements.
2) Training by using a data set containing noise to obtain a pre-training model, and then removing the noise data according to the prediction category comparison of the pre-training model; and then carrying out training learning on the data subjected to noise removal by using a final algorithm model.
The algorithm solution based on the deep learning neural network algorithm has large general training data amount, but the pre-training algorithm model is trained based on noisy data, the model precision is not too high, so that when the pre-training model carries out original data category prediction, simple noise samples with large feature difference can be correctly predicted, difficult noise with small difference close to the features of normal samples cannot be correctly predicted, and some normal samples with large feature difference can be wrongly recognized. For example, the data outside the dashed circle 102 in fig. 1A may be considered to be abnormal noise data, and although most of the abnormal noise data is noise data, a small amount of normal data may be included; and the noise data within the dashed circle 102 is still not removed. In addition, the algorithm solution is an off-line version noise removal method, so that after model training is completed, a user needs to manually use the algorithm model to predict original data again, and then noise removal is performed. In the whole model training process, interaction with a user is required for many times.
In order to solve at least one problem of a data denoising method in the related art, an embodiment of the present application provides a data denoising method, which is a method for online noise removal through deep learning network training, and after each round or multiple rounds of training of a neural network model to be trained are completed, an average distance between each sample and other samples in the class is obtained by calculating a distance between each class of samples (i.e., sample data in a data set to be denoised), data with an excessively large abnormal value in the average distance is removed, and the remaining data is used for the next round of training. Therefore, the effect of the neural network model obtained in the next round of training is improved compared with that of the previous round, abnormal data in the current round are cleared based on the training characteristics of the new round, and meanwhile, data mistakenly cleared in the previous round are recalled until the neural network model to be trained is converged. By using the method of the embodiment of the application, the noise from simple to difficult can be gradually eliminated, and the mistakenly eliminated normal data can be recalled, so that a neural network algorithm with a good effect can be learned under the condition of avoiding a large amount of manpower consumption. In addition, compared with an offline mode separated from a traditional training/denoising module, the denoising method disclosed by the embodiment of the application can couple and embed denoising method codes into training modules of various deep learning neural network models without additional human-computer interaction operation, so that a user can use the method more conveniently and quickly.
The data denoising method provided by the embodiment of the application comprises the steps of firstly, training a preset algorithm model by adopting data in a data set to be denoised to obtain the trained algorithm model; then, when the trained algorithm model does not meet a model convergence condition, determining the trained algorithm model as a network model to be trained, and determining a first average distance between each data in a data set to be denoised and other data in the data set to be denoised; determining noise data in a data set to be denoised according to the first average distance; removing noise data in a data set to be denoised to obtain an updated data set; then, inputting the data in the updated data set into the network model to be trained as sample data, and training the network model to be trained to obtain a trained network model; and finally, when the trained network model obtained by training the data in the updated data set meets the model convergence condition, determining the updated data set as the data set subjected to denoising. Therefore, the network model to be trained can be trained circularly, so that the data set to be denoised is denoised gradually, and the accuracy of removing noise data can be improved; moreover, because the denoising method codes are coupled and embedded into various network models to be trained, additional human-computer interaction operation is not needed, and therefore a user can more conveniently and rapidly use the method of the embodiment of the application to denoise data and train the models.
An exemplary application of the data denoising device according to the embodiment of the present application is described below, in one implementation, the data denoising device according to the embodiment of the present application may be implemented as any terminal such as a notebook computer, a tablet computer, a desktop computer, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), an intelligent robot, and in another implementation, the data denoising device according to the embodiment of the present application may also be implemented as a server. Next, an exemplary application when the data denoising apparatus is implemented as a server will be explained.
Referring to fig. 2, fig. 2 is a schematic diagram of an alternative architecture of a data denoising system 10 according to an embodiment of the present application. In some embodiments, in order to achieve accurate and effective removal of noise data in a data set to be denoised, the data denoising system 10 provided in the embodiment of the present application includes a terminal 100, a network 200, and a server 300, where the terminal 100 runs a data denoising application, a user may request a client of the data denoising application to remove the noise data in the data set to be denoised, and in an implementation process, the terminal 100 sends a data denoising request to the server 300 through the network 200, the data denoising request includes the data set to be denoised, the data set to be denoised includes at least one piece of data, and the at least one piece of data includes not only normal data but also noise data. After receiving the data denoising request, the server 300 determines a first average distance between each data in the data set to be denoised and other data in the data set to be denoised in response to the data denoising request; determining noise data in a data set to be denoised according to the first average distance; removing noise data in a data set to be denoised to obtain an updated data set; inputting data in the updated data set into the network model to be trained as sample data, and training the network model to be trained to obtain a trained network model; when a trained network model obtained by training data in the updated data set meets a model convergence condition, determining the updated data set as a de-noised data set; and determines the data set with the target denoising completed as the search result of the data denoising request, and sends the search result to the terminal 100.
The data denoising method provided by the embodiment of the application also relates to the technical field of artificial intelligence, and can be at least realized through a machine learning technology in the artificial intelligence technology. Machine Learning (ML) is a one-field multi-field cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. In the embodiment of the application, the response to the network structure search request is realized through a machine learning technology so as to automatically search a target network structure, and the training and model optimization of the controller and the score model are realized.
Fig. 3 is a schematic structural diagram of a server 300 according to an embodiment of the present application, where the server 300 shown in fig. 3 includes: at least one processor 310, memory 350, at least one network interface 320, and a user interface 330. The various components in server 300 are coupled together by a bus system 340. It will be appreciated that the bus system 340 is used to enable communications among the components connected. The bus system 340 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 340 in fig. 3.
The Processor 310 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 330 includes one or more output devices 331, including one or more speakers and/or one or more visual display screens, that enable presentation of media content. The user interface 330 also includes one or more input devices 332, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 350 optionally includes one or more storage devices physically located remote from processor 310. The memory 350 may include either volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 350 described in embodiments herein is intended to comprise any suitable type of memory. In some embodiments, memory 350 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below, to support various operations.
An operating system 351 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 352 for communicating to other computing devices via one or more (wired or wireless) network interfaces 320, exemplary network interfaces 320 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
an input processing module 353 for detecting one or more user inputs or interactions from one of the one or more input devices 332 and translating the detected inputs or interactions.
In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 3 illustrates a data denoising apparatus 354 stored in the memory 350, where the data denoising apparatus 354 may be a data denoising apparatus in the server 300, and may be software in the form of programs and plug-ins, and the like, and includes the following software modules: the first training module 3541, the first determining module 3542, the second determining module 3543, the removing module 3544, the training module 3545, and the third determining module 3546, which are logical and thus may be arbitrarily combined or further separated depending on the functionality implemented. The functions of the respective modules will be explained below.
In other embodiments, the apparatus provided in the embodiments of the present Application may be implemented in hardware, and for example, the apparatus provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the data denoising method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
The data denoising method provided by the embodiment of the present application will be described below with reference to an exemplary application and implementation of the server 300 provided by the embodiment of the present application. Referring to fig. 4, fig. 4 is an optional flowchart of a data denoising method provided in an embodiment of the present application, and will be described with reference to the steps shown in fig. 4.
Step S401, training a preset algorithm model by using data in a data set to be denoised to obtain the trained algorithm model.
Here, the data set to be denoised includes at least one piece of data, and each piece of data may be noise data or normal data. In the same data set to be denoised, the data can be the same type of data, that is, each piece of data in the data set to be denoised has a label, and the labels of the data in the same data set to be denoised are the same. For example, the data set to be denoised includes at least one picture, and the label of each picture is a landscape, but there may be an error in the labels previously printed by some pictures instead of each picture, and then the picture (for example, a person picture) corresponding to the error label is the noise data to be denoised in the embodiment of the present application.
The embodiment of the application can be applied to the following scenes: the terminal runs a data denoising application, when a user wants to denoise data in any data set to be denoised, the data in the data set to be denoised can be input into the data denoising application, the client of the data denoising application sends the data set to be denoised to an application server, and the server is used for denoising the data set to be denoised.
Alternatively, it can also be applied to the following scenarios: the terminal is provided with an image processing application (such as an image recognition application) in operation, when a user wants to perform image processing on any image to be processed, an image processing model needs to be trained to perform image processing on the image to be processed, and the image processing model needs to be trained with sample data with certain accuracy, so that a certain number of images with the same label can be obtained from an image database, a data set to be denoised is formed, and the data set to be denoised is sent to the server to be denoised. And after the denoising treatment is carried out to obtain a clean data set, training the image processing model by adopting the clean data set to obtain the trained image processing model. And the terminal performs image processing on the image to be processed by adopting the trained image processing model.
The preset algorithm model may be any type of network model, for example, when the data in the data set to be denoised is a picture type, the preset algorithm model may be a convolutional neural network, when the data in the data set to be denoised is a video type, the preset algorithm model may be a 3D convolutional network, and when the data in the data set to be denoised is a text type, the preset algorithm model may be a cyclic neural network.
In the embodiment of the application, after the data set to be denoised is obtained, the data in the data set to be denoised is used as sample data to train the preset algorithm model for the first time, and at this time, the data set to be denoised may contain noise data, so that the trained algorithm model cannot be converged, that is, a specific model convergence condition cannot be met. And when the data set to be denoised does not contain noise data, the trained algorithm model is converged.
In the embodiment of the application, after the data set to be denoised is obtained, the data set to be denoised can be directly adopted for model training, that is, during the first training, the data set which is not denoised is adopted for model training. Therefore, when the data set to be denoised obtained at the beginning is a clean data set, the step of denoising is not needed, and the data set to be denoised is judged to be the clean data set directly according to the fact that the trained network model meets the model convergence condition, so that the step of denoising is greatly simplified.
Step S402, when the trained algorithm model does not meet the model convergence condition, determining the trained algorithm model as a network model to be trained, and determining a first average distance between each data in the data set to be denoised and other data in the data set to be denoised.
Here, when the trained algorithm model does not satisfy the model convergence condition, it indicates that the trained algorithm model does not converge, and at the same time, it may indicate that the data set to be denoised contains noise data, and it is necessary to remove the noise data in the data set to be denoised, so that a first average distance between each data and other data is determined, and a data denoising process is further completed based on the first average distance.
The first average distance is an average degree of difference between each data and the other data, and indicates that the degree of difference between the data and the other data is larger when the first average distance is larger, and indicates that the degree of difference between the data and the other data is smaller when the first average distance is smaller. The first average distance may be an average of the distances between each data and each other data.
Step S403, determining noise data in the data set to be denoised according to the first average distance.
In the embodiment of the present application, the noise data includes simple noise data and difficult noise data, wherein the simple noise data refers to noise data with larger difference and farther distance from the normal data, and the difficult noise data refers to noise data with smaller difference and closer distance from the normal data.
In this embodiment of the present application, data with a first average distance greater than a specific threshold may be determined as noise data, where the specific threshold may be preset by a user, or may be a value determined according to the first average distance of each data in a data set to be denoised.
And S404, removing noise data in the data set to be denoised to obtain an updated data set.
Here, the determined noise data may be directly removed from the data set to be denoised, so as to obtain an updated data set that is subjected to a primary denoising process.
Step S405, inputting the data in the updated data set into the network model to be trained as sample data, and training the network model to be trained to obtain the trained network model.
Here, the network model to be trained may be any one type of network model, and for example, the network model to be trained may be an image processing model, a text processing model, a video recognition model, and the like. For different network models, the training method corresponding to the model can be adopted for training.
Step S406, when the trained network model obtained by training the data in the updated data set meets the model convergence condition, determining the updated data set as the data set with the de-noising completed.
Here, each time the trained network model is trained to obtain the trained network model, it is determined whether the trained network model satisfies the model convergence condition, that is, whether the model has converged, and a loss convergence determination is performed on the model to determine whether a loss result corresponding to an output of the model converges. If the loss result corresponding to the output of the model is converged, determining that the trained network model meets the model convergence condition, thereby indicating that the data in the current update data set can be trained to obtain an accurate trained network model, further indicating that the current update data set has been subjected to effective noise data removal, and determining the current update data set as a data set subjected to de-noising; and if the loss result corresponding to the output of the model is not converged, determining that the network model after the training does not meet the model convergence condition, and continuing to perform denoising processing on the updated data set to further perform model training.
In this embodiment of the application, the model convergence condition may be a condition that is set by a user in advance according to the type of the network model to be trained, and the model convergence condition is a convergence condition corresponding to a loss result of the model, that is, whether the trained network model meets the model convergence condition is determined according to the loss result of the model.
The data denoising method provided by the embodiment of the application determines a first average distance between each data in a data set to be denoised and other data; determining noise data in a data set to be denoised according to the first average distance; removing noise data to obtain an updated data set; and inputting data in the updated data set into a network model to be trained as sample data, training the network model to be trained until the trained network model meets a model convergence condition, and obtaining a data set with the denoising completed. Therefore, the network model to be trained can be trained circularly, the data set to be denoised is denoised gradually, and the accuracy of removing noise data can be improved. Moreover, because the denoising method codes are coupled and embedded into various network models to be trained, additional human-computer interaction operation is not needed, and therefore a user can more conveniently and rapidly use the method of the embodiment of the application to denoise data and train the models.
In some embodiments, the method of the embodiment of the present application may also be applied to a scenario in which a network model to be trained is trained through a data set subjected to denoising, and data processing is performed by using the trained network model. For example, the network model to be trained may be an image recognition model. Here, the network model to be trained is an image recognition model, and the data denoising system includes a terminal and a server. Fig. 5 is an optional flowchart of the data denoising method according to the embodiment of the present application, and as shown in fig. 5, the method includes the following steps:
step S501, a terminal obtains user operation, and the user operation is used for requesting denoising processing on a data set to be denoised.
In the embodiment of the application, the terminal runs an image recognition application, and before image recognition, a user can select to denoise a specific data set to be denoised so as to obtain an image recognition model by adopting the denoised data set for training. Correspondingly, a denoising button is provided on a client of the image recognition application, and at least one selection button of a data set to be denoised is provided, when a user wants to perform model training by using a certain data set to be denoised, denoising processing can be performed on the certain data set to be denoised first, and then the user can select the data set to be denoised on the terminal and click the denoising button to trigger a denoising processing process.
In the embodiment of the application, the user operation comprises an operation of selecting a data set to be denoised by a user and an operation of clicking a denoising button.
Step S502, the terminal sends a data denoising request to the server, wherein the data denoising request comprises a data set to be denoised.
In step S503, the server determines a first average distance between each data in the data set to be denoised and other data in the data set to be denoised.
Step S504, the server determines noise data in the data set to be denoised according to the first average distance.
And step S505, removing noise data in the data set to be denoised by the server to obtain an updated data set.
And S506, the server acquires the network model to be trained according to the data denoising request.
And step S507, the server inputs the data in the updated data set into the network model to be trained as sample data, and trains the network model to be trained to obtain the trained network model.
It should be noted that steps S502 to S507 are the same as steps S401 to S406, and the description of the embodiment of the present application is omitted.
Step S508, the server determines whether the trained network model obtained by training with the data in the updated data set satisfies the model convergence condition.
When the judgment result is yes, step S509 is executed; and when the judgment result is negative, determining the updated data set as the data set to be denoised, and returning to continue executing the step S503.
And if the judgment result is negative, returning to continue the cyclic denoising process until the trained network model obtained by training the data in the denoised updated data set is converged.
In step S509, the server determines the updated data set as a denoised data set.
And step S510, the server sends the de-noised data set to the terminal and sends the trained network model meeting the model convergence condition to the terminal.
And step S511, the terminal acquires an image to be identified. The image to be recognized may be any type of image.
And S512, the terminal takes the trained network model as an image recognition model, and recognizes the image to be recognized by adopting the image recognition model to obtain an image recognition result.
Because the trained network model meets the model convergence condition, the trained network model can be used independently, and therefore the trained network model is used as an image recognition model to perform image recognition processing on the image to be recognized.
In the embodiment of the application, after the image recognition model is obtained through training, the image recognition model can be stored, and a specific label or identifier is added to the image recognition model to distinguish the image recognition model from other trained network models. Therefore, in the subsequent image recognition process, a user can directly select the model to perform image recognition without performing data denoising again or training based on the denoised data set to obtain a new image recognition model.
In step S513, the terminal displays the image recognition result on the current interface.
The data denoising method provided by the embodiment of the application can be dropped into any actual scene needing data processing by adopting a trained network model, namely the method of the embodiment of the application corresponds to an application product which can be any application product such as image recognition application, text processing application, video processing application and the like.
In the application, when no specific trained network model exists, a user can train the model in real time to obtain the model meeting the user data processing requirement, denoising any data set to be denoised and performing model training by adopting the denoised data set to obtain the network model required by the user. Therefore, on one hand, a network model which best meets the requirements of a user can be obtained, the use experience of the user is improved, on the other hand, the network model is trained by adopting a clean data set after denoising, so that the accuracy of the network model obtained by training is higher, and on the other hand, the method provided by the embodiment of the application can be used for denoising the data set to be denoised to obtain a clean updated data set, so that the effective and accurate pretreatment of the data is realized.
Based on fig. 4 and fig. 6 are an optional flow chart schematic diagram of the data denoising method provided in the embodiment of the present application, and as shown in fig. 6, the step of determining the first average distance between each data in the data set to be denoised and other data in the data set to be denoised in step S402 may be implemented by the following steps:
step S601, extracting the characteristics of each data in the data set to be denoised to obtain the characteristic vector of each data.
Here, a feature extraction network may be used to extract features of data, and the feature extraction network may be determined according to the type of features to be extracted, for example, when extracting features of a picture, the feature extraction network may be a convolutional neural network; when extracting text features, the feature extraction network may be a recurrent neural network; when extracting video features, the feature extraction network may be a 3D convolutional network or the like.
Step S602, calculating a first average distance between each data in the data set to be denoised and other data in the denoised data set according to the feature vector of each data.
In some embodiments, calculating the first average distance in step S602 may be implemented by: acquiring a feature vector of each data and the total amount of data in a data set to be denoised; calculating the distance between each data and each other data through the feature vector of each data and the feature vectors of other data; and determining a first average distance of each datum according to the distance between each datum and each other datum and the total amount of the data in the data set to be denoised. Correspondingly, calculating the first average distance may be achieved by the following equation (1-1):
Figure BDA0002708799750000171
wherein x isiRepresenting the characteristic vector of the ith data in the data set to be denoised; x is the number ofjRepresenting the characteristic vector representation of the jth data in the data set to be denoised; n represents the total amount of data in the data set to be denoised; d () represents a distance function;
Figure BDA0002708799750000172
denotes xiAnd xjA first average distance therebetween.
Fig. 7 is an optional flowchart of the data denoising method according to the embodiment of the present application, and as shown in fig. 7, the method includes the following steps:
step S701, a data denoising request is received, wherein the data denoising request comprises a data set to be denoised.
Step S702, determining a first average distance between each data in the data set to be denoised and other data in the data set to be denoised.
Step S703, determining noise data in the data set to be denoised according to the first average distance.
Step S704, removing noise data in the data set to be denoised to obtain an updated data set.
Step S705, inputting the data in the updated data set into the network model to be trained as sample data, and training the network model to be trained to obtain the trained network model.
It should be noted that steps S701 to S705 are the same as steps S401 to S405, and the embodiment of the present application is not repeated.
Step S706, judging whether the trained network model obtained by training the data in the updated data set meets the model convergence condition.
When the judgment result is yes, step S707 is executed; when the determination result is no, step S708 is executed.
Step S707, the updated data set is determined as the data set that has been denoised.
Step 708, determining the updated data set as the current data set to be denoised.
Here, when the trained network model obtained by training the data in the updated data set does not satisfy the model convergence condition, the network model needs to be trained continuously until the trained network model satisfies the model convergence condition, so that the current updated data set can be determined as the current data set to be denoised, and the current data set to be denoised is the data set which needs to be denoised again.
Step S709, determining a second average distance between each data in the current data set to be denoised and other data in the current data set to be denoised.
Here, the method of determining the second average distance is the same as the method of determining the first average distance described above.
Step S710, determining noise data in the current data set to be denoised according to the second average distance.
Step S711, removing noise data in the current data combination to be denoised to obtain a current update data set.
Step S712, the step of obtaining the current update data set is circulated until the trained network model obtained by training the data in the current update data set satisfies the model convergence condition, and the current update data set is determined as the data set with the de-noising completed.
That is to say, if the trained network model obtained by training the data in the updated data set does not satisfy the model convergence condition, the updated data set is determined as the current data set to be denoised, the steps from step S702 to step S706 are executed in a circulating manner, after the noise data in the data set to be denoised is determined each time, the noise data in the data set to be denoised is removed, the current updated data set is obtained, and the circulation is stopped until the trained network model obtained by training the data in the current updated data set satisfies the model convergence condition.
In some embodiments, before step S401, after acquiring the data set to be denoised, the method further includes: and step S41, classifying the data in the data set to be denoised to obtain at least one data class to be denoised.
In the embodiment of the application, after the data set to be denoised is obtained, the data set to be denoised may be classified first, for example, the data set to be denoised may be classified according to the labels of the data, and the data with the same label may be classified into the same data class to be denoised. Labels for data herein include, but are not limited to: images, texts, videos, and the like, or, when all of the data sets to be denoised are images, the labels of the data include, but are not limited to: people, landscape, construction, animals, etc. Alternatively, in some embodiments, the data in the data set to be denoised includes a primary label and a secondary label, wherein the primary label may include but is not limited to: images, text, video, etc., in the secondary label of the image, including but not limited to: people, landscape, construction, animals, etc.
In the embodiment of the application, data can be classified once according to the primary label, and the secondary label is adopted to classify the data again after the primary label is adopted to classify the data, so that more precise data to be denoised are obtained.
And step S42, performing noise removal on each class of data to be denoised to obtain a denoised data set corresponding to each class of data to be denoised.
Here, after the data classification, the denoising method provided in the embodiment of the present application may be adopted to perform denoising processing on each class of data, that is, perform noise removal on each class of data to be denoised, so as to obtain a denoised data set corresponding to each class of data to be denoised.
For example, when a user wants to train to obtain an image recognition model for performing image recognition, an original data set, which is an unprocessed data set including a large number of images, texts, and videos, may be first obtained from a large database, and at this time, the large number of unprocessed data may be classified according to a primary label to obtain an image data class, a text data class, and a video data class, and an image data class is selected, and the image data class is secondarily classified by using a secondary label to obtain image data of different classes. At this time, an image data set of the animal category can be selected, and the image data set of the animal category is used as a data set to be denoised.
Based on fig. 4, fig. 8 is an optional flowchart of the data denoising method provided in the embodiment of the present application, and as shown in fig. 8, step S403 may be implemented by the following steps:
step S801, determine whether the first average distance of each data is greater than a preset threshold. If yes, go to step S802; when the determination result is no, the process returns to step S801 to continue determining the next data.
In some embodiments, when the network model to be trained is subjected to multiple rounds of training, the preset threshold value in each round of training may be the same or different, and when the preset threshold value in each round of training is different, the preset threshold value in the next round may be smaller than the preset threshold value in the previous round.
In step S802, the corresponding data is determined as noise data.
Referring to fig. 8, in some embodiments, after step S405, the method further includes the following steps: step S803, an output result of the trained network model is obtained.
Here, each time the network model is trained, the network model outputs an output result.
Step S804, inputting the output result into a preset loss model to obtain a loss result.
Here, the preset loss model is used to compare the output result with a preset real result (i.e. a label of each data in the data set to be denoised), so as to obtain a loss result.
In the embodiment of the application, the preset loss model comprises a loss function, and the similarity between the output result and the preset real result can be calculated through the loss function to obtain the loss result.
And step S805, determining whether the trained network model meets the model convergence condition according to the loss result.
Here, when the loss result indicates that the difference between the output result and the preset real result is large, the trained network model does not meet the model convergence condition; and when the loss result shows that the difference between the output result and the preset real result is smaller, the trained network model meets the model convergence condition.
Based on fig. 4, fig. 9 is an optional flowchart of the data denoising method provided in the embodiment of the present application, and as shown in fig. 9, after step S405, when a trained network model obtained by training with data in an updated data set does not satisfy a model convergence condition, the method further includes the following steps:
step S901, determining a third average distance between each noise data removed in the previous round of data denoising process and the data in the updated data set.
In step S902, when the third average distance of any one of the noise data is smaller than a preset threshold, and the data is determined to be noise data and the number of times of removal is smaller than or equal to a number-of-times threshold, the corresponding noise data is determined to be recall data.
Here, when any data is determined to be noise data and the number of times the data is removed is greater than the number-of-times threshold, indicating that the probability that the data is noise data is large, the recall processing may not be performed on the data in order to reduce the amount of calculation.
Step S903, adding the recall data into the update data set to form an update data set with the recall data.
And step S904, adopting the updated data set with the recall data to train the trained network model again until the trained network model meets the model convergence condition.
According to the embodiment of the application, after each training, the noise data removed in the previous round of data denoising process is recalled for judging again, and the current model is the trained model and is higher in precision, so that the noise data can be further accurately judged to avoid that normal data is misjudged as the noise data and deleted, and the possibility that some difficult noise data is deleted is avoided.
Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.
The embodiment of the application provides a data denoising method, which is an online denoising method coupling denoising and training, namely, the noise is removed simultaneously in the process of learning and training a neural network model, so that the noise can be removed conveniently when each round or multiple rounds of neural network training are finished.
In the noise removing stage of the embodiment of the application, firstly, the distance between each type of samples is calculated; then eliminating abnormal data with overlarge distance; the noise data removed during the last noise removal stage is then also subject to inter-sample distance calculations, recalling the removed data with small distances. Through multiple rounds of noise removal operations, not only can noise be removed from simple to difficult step by step, but also normal data which is removed previously can be recalled. Finally, a deep learning algorithm model with higher precision than that of a traditional denoising algorithm can be trained while avoiding the waste of a large amount of manpower and manual denoising. In addition, the whole denoising process code can be embedded into the model training process, an additional human-computer interaction process is not needed in the denoising process, and the method and the device are convenient for users to use.
The embodiment of the application provides a method for simultaneously removing noise in the process of learning and training a neural network model, and the method can be used for training an algorithm model for similar advertisement material retrieval in an advertisement recommendation scene.
The specific application scenario can be described as follows: collecting advertisement data such as pictures, texts, videos and the like of a large number of advertisements from an advertisement material library; the online noise elimination method provided by the embodiment of the application is used when a picture feature extraction network (such as a convolutional neural network), a text feature extraction network (such as a cyclic neural network) and a video feature extraction network (such as a three-dimensional convolutional neural network) are trained respectively. For example, the feature extraction can be performed on the advertisement material library by using an optimization model obtained by training, and an advertisement material picture/text/video feature library is constructed; and receiving new advertisement materials, performing feature extraction by using the trained optimization model, and comparing the extracted feature with a feature library to recall similar advertisements.
The embodiment of the application only takes the training of the advertisement material retrieval algorithm model in the advertisement recommendation scene as an example, and explains the main principle and the implementation mode of the application. However, it should be emphasized that the usage scenario of the embodiment of the present application is not limited to the advertisement recommendation, and deep learning algorithm model training in the field scenarios such as computer vision/natural language processing/reinforcement learning and the like also belongs to the protection scope of the present application.
The embodiment of the application provides an online noise elimination method for deep learning network training. Based on multiple rounds of sample denoising and recalling, noise-containing data (namely a data set to be denoised) are gradually cleaned from easy to difficult, and finally, noise (namely noise data) is effectively cleaned while a large amount of manpower is avoided, so that a deep learning algorithm model with better effect precision can be obtained by training and learning. In addition, the embodiment of the application can carry out online denoising by embedding the realized denoising code into the model training step, does not need an additional manual interaction step, and is convenient for users to use. The main flow of the method is shown in fig. 10, and comprises the following steps:
step S1001, training sample data is collected.
Here, according to the relevant application scenario, collecting training sample data, where the training sample data is noisy data, for example, in an e-commerce advertisement picture retrieval scenario, the training sample data is a commodity picture of each commodity category, and each commodity category is each category in the training sample data; in the case of text or video advertisement retrieval scenarios, the training sample data is text and video.
Step S1002, learning and training parameters of the deep learning algorithm model.
The deep learning algorithm model refers to an algorithm model corresponding to the online noise elimination method trained by the deep learning network.
As shown in fig. 11, a learning framework of a general neural network model is obtained by inputting a training sample 1101 into a neural network Backbone 1102(Backbone), extracting features of the training sample 1101 by the neural network Backbone 1102 to obtain a feature expression 1103 of the training sample 1101, and finally performing gradient update of the neural network model by a loss calculation 1104 until the neural network model converges.
In this embodiment, the neural network backbone 1102 is determined according to the type of features to be extracted, for example, when extracting picture features, the neural network backbone 1102 may be a convolutional neural network, when extracting text features, the neural network backbone 1102 may be a cyclic neural network, and when extracting video features, the neural network backbone 1102 may be a 3D convolutional network, or the like. The neural network model further includes a feature expression layer for determining a feature expression 1103 of the training sample 1101, and the feature expression layer is generally a full connection layer (FC layer) for calculating a depth feature of the sample. The loss calculation is used for gradient updating of model parameters, and meanwhile, is also used for judging whether an algorithm model is converged, and different loss functions are generally used for different tasks, such as multi-class Softmax loss commonly used in a classification task and triple loss commonly used in a metric learning task.
In the embodiment of the application, a training sample is input, neural network model parameter training is performed, and loss convergence judgment is performed after each round or N rounds (N can be a value set by a user and used for accelerating training) of training are finished. And if the loss is converged, outputting the algorithm model, and otherwise, entering the next step.
And step S1003, judging whether the deep learning algorithm model converges.
If the determination result is yes, step S1004 is executed, and if the determination result is no, step S1005 is executed.
And step S1004, outputting the deep learning algorithm model.
In step S1005, a feature expression of the normal data/upper round noise data is extracted.
Here, forward calculation of a model network is performed on the normal data and the noise data of the previous round by using the deep learning algorithm model in step S1002, and feature vectors (i.e., feature expressions) of the feature expression layer are obtained, and a common normalization operation is generally performed on the feature vectors to facilitate calculation of the distance between subsequent features.
In step S1006, the sample average distance (i.e., the first average distance) of each type of data is calculated.
Here, the sample average distance of each sample feature in each type of data is calculated. This step is used for the next step of noise data judgment and removal; and meanwhile, the average distance between the upper round of cleared noise data and the sample data of the class is calculated in the step and is used for recalling the upper round of mistakenly cleared data. Assuming that the above step S1005 extracts features of data in a certain category, which are expressed as x1, x2, …, xN, where N is the total number of samples in the category, the average distance between the ith sample and the samples in the category can be represented by the following formula (2-1):
Figure BDA0002708799750000241
wherein x isiBook showing wheel liftTaking the characteristic vector representation of the ith data in the N data; x is the number ofjRepresenting the characteristic vector representation of the jth data in the N data extracted in the current round; d () represents a distance function, which may be, for example, a euclidean distance;
Figure BDA0002708799750000242
denotes xiAnd xjThe average distance of samples in between. In general, the average distance difference between the noise data and the normal data is large, and the average distance difference between the normal data is small; the average distance between the simple noise data and the normal data is large, and the average distance between the difficult noise data and the normal data is small.
Step S1007, determine whether the sample average set is greater than the threshold T.
If the judgment result is yes, executing step S1008; if the determination result is no, step S1009 is executed.
Here, noisy data with too large a difference in sample average distance is cleared. Regarding the normal data which is considered to be the sample data with the average sample distance from the other data being less than the threshold value T, the normal data is reserved for updating the optimization model in the next round; samples having an average distance above a threshold T are considered noisy data. The threshold T is determined by the user and may be a euclidean distance threshold. For example, the threshold T set in the e-commerce advertisement picture retrieval scenario is 0.4. Generally, the model precision of the previous rounds of training is general, and only simple noise can be eliminated according to a threshold value; with the gradual removal of the noise data, the model precision is higher and higher, the distance between the noise and the normal data is gradually larger, and the distance between the normal data is gradually smaller. Therefore, the difficult noise data can be gradually removed, as shown in fig. 12, which is a sample data distribution comparison diagram of training multiple rounds, wherein the left diagram is a schematic diagram of a sample data distribution of a certain type before the training multiple rounds, and the right diagram is a schematic diagram of a sample data distribution of a certain type after the training multiple rounds.
In some embodiments, the distance threshold determination is also performed on the noise data cleared in the previous round by using the model extraction features optimized in the current round. Therefore, for the error deletion of normal data caused by the limited model precision of the upper round, the round can recall the training used as the model of the next round. Therefore, normal sample data can be reserved as much as possible, and the effect precision of the algorithm model is ensured.
In step S1008, the noise data is cleared. In step S1009, normal data is retained.
Using the normal data retained in the above step S1009, the steps S1002 to S1009 are repeated again until the model converges.
The denoising method provided by the embodiment of the application can be conveniently realized by using codes and is coupled and embedded into deep learning neural network training processes of various application scenes. The online denoising method provided by the embodiment of the application does not need additional human-computer interaction, so that a user can use the online denoising method more quickly and conveniently.
Compared with the time cost of weeks or even months spent on manual denoising, the method provided by the embodiment of the application can save a large amount of labor and time cost, and is convenient for rapid landing and iteration of the algorithm in the actual engineering service; compared with the traditional denoising method, the method and the device have the advantages that not only can simple noise samples be eliminated, but also difficult noise samples can be eliminated, and in addition, the previous mistakenly eliminated normal data can be recalled in the subsequent denoising process, so the final algorithm model effect is better. In addition, the on-line denoising method for coupling denoising and training provided by the embodiment of the application does not need additional human-computer interaction, and is more convenient and faster to use.
Continuing with the exemplary structure of the data denoising apparatus 354 provided in this embodiment of the present application implemented as a software module, in some embodiments, as shown in fig. 3, the software module stored in the data denoising apparatus 354 of the memory 350 may be a data denoising apparatus in the server 300, including:
the first training module 3541 is used for training a preset algorithm model by using data in a data set to be denoised to obtain a trained algorithm model;
a first determining module 3542, configured to determine the trained algorithm model as a network model to be trained and determine a first average distance between each data in the data set to be denoised and other data in the data set to be denoised when the trained algorithm model does not satisfy a model convergence condition;
a second determining module 3543, configured to determine noise data in the data set to be denoised according to the first average distance;
a removing module 3544, configured to remove the noise data in the data set to be denoised, so as to obtain an updated data set;
a training module 3545, configured to input data in the updated data set as sample data into a network model to be trained, and train the network model to be trained to obtain a trained network model;
a third determining module 3546, configured to determine the updated data set as a de-noised data set when the trained network model obtained by training with the data in the updated data set satisfies a model convergence condition.
In some embodiments, the first determining module is further configured to: extracting the characteristics of each data in the data set to be denoised to obtain the characteristic vector of each data; and calculating a first average distance between each data in the data set to be denoised and other data in the data set to be denoised according to the feature vector of each data.
In some embodiments, the first determining module is further configured to: acquiring the feature vector of each data and the total amount of data in the data set to be denoised; calculating the distance between each data and each other data through the feature vector of each data and the feature vectors of other data; and determining the first average distance of each data according to the distance between each data and each other data and the total amount of data in the data set to be denoised.
In some embodiments, the apparatus further comprises: a fourth determining module, configured to determine the updated data set as a current data set to be denoised when the trained network model obtained by training with the data in the updated data set does not satisfy the model convergence condition; a fifth determining module, configured to determine a second average distance between each data in the current data set to be denoised and other data in the current data set to be denoised; a sixth determining module, configured to determine, according to the second average distance, noise data in the current data set to be denoised; the second removing module is used for removing the noise data in the current data combination to be denoised to obtain a current updating data set; and the cyclic training module is used for cycling the step of obtaining the current updating data set until the trained network model obtained by training the data in the current updating data set meets the model convergence condition, and determining the current updating data set as the de-noised data set.
In some embodiments, the cycle training module is further to: and after the noise data in the data set to be denoised is determined each time, removing the noise data in the data set to be denoised to obtain the current updated data set.
In some embodiments, the apparatus further comprises: the classification module is used for classifying the data in the data set to be denoised after the data denoising request is received to obtain at least one data class to be denoised; and the noise removing module is used for removing noise of each data class to be denoised so as to obtain a denoised data set corresponding to each data class to be denoised.
In some embodiments, the second determination module is further configured to: and when the first average distance of any data is larger than a preset threshold value, determining the corresponding data as the noise data.
In some embodiments, the apparatus further comprises: the first input module is used for inputting the data in the updated data set into a network model to be trained as sample data, training the network model to be trained to obtain a trained network model, and then acquiring an output result of the trained network model; the second input module is used for inputting the output result into a preset loss model to obtain a loss result; and the judging module is used for determining whether the trained network model meets the model convergence condition according to the loss result.
In some embodiments, the apparatus further comprises: a seventh determining module, configured to determine a third average distance between each noise data removed in the previous round of data denoising and the data in the update data set; an eighth determining module, configured to determine, when the third average distance of any noise data is smaller than a preset threshold, corresponding noise data as recall data; the adding module is used for adding the recall data into the update data set to form an update data set with the recall data; and the retraining module is used for retraining the trained network model by adopting the updated data set with the recall data until the trained network model meets the model convergence condition.
In some embodiments, the apparatus further comprises: and the control module is used for not recalling any data when the data is determined to be the noise data and the removed times are greater than a time threshold value.
It should be noted that the description of the apparatus in the embodiment of the present application is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated. For technical details not disclosed in the embodiments of the apparatus, reference is made to the description of the embodiments of the method of the present application for understanding.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method of the embodiment of the present application.
Embodiments of the present application provide a storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform a method provided by embodiments of the present application, for example, the method as illustrated in fig. 4.
In some embodiments, the storage medium may be a computer-readable storage medium, such as a Ferroelectric Random Access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), a charged Erasable Programmable Read Only Memory (EEPROM), a flash Memory, a magnetic surface Memory, an optical disc, or a Compact disc Read Only Memory (CD-ROM), among other memories; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (13)

1. A method for denoising data, comprising:
training a preset algorithm model by adopting data in a data set to be denoised to obtain a trained algorithm model;
when the trained algorithm model does not meet the model convergence condition, determining the trained algorithm model as a network model to be trained, and determining a first average distance between each data in the data set to be denoised and other data in the data set to be denoised;
determining noise data in the data set to be denoised according to the first average distance;
removing the noise data in the data set to be denoised to obtain an updated data set;
inputting the data in the updated data set into a network model to be trained as sample data, and training the network model to be trained to obtain a trained network model;
and when the trained network model obtained by training the data in the updated data set meets the model convergence condition, determining the updated data set as a de-noised data set.
2. The method of claim 1, wherein the determining a first average distance between each data in the set of data to be denoised and other data in the set of data to be denoised comprises:
extracting the characteristics of each data in the data set to be denoised to obtain the characteristic vector of each data;
and calculating a first average distance between each data in the data set to be denoised and other data in the data set to be denoised according to the feature vector of each data.
3. The method of claim 2, wherein the calculating a first average distance between each data in the set of data to be denoised and other data in the set of data to be denoised according to the feature vector of each data comprises:
acquiring the feature vector of each data and the total amount of data in the data set to be denoised;
calculating the distance between each data and each other data through the feature vector of each data and the feature vectors of other data;
and determining the first average distance of each data according to the distance between each data and each other data and the total amount of data in the data set to be denoised.
4. The method of claim 1, further comprising:
when the trained network model obtained by training the data in the updated data set does not meet the model convergence condition, determining the updated data set as a current data set to be denoised;
determining a second average distance between each data in the current data set to be denoised and other data in the current data set to be denoised;
determining noise data in the current data set to be denoised according to the second average distance;
removing the noise data in the current data combination to be denoised to obtain a current update data set;
and circulating the step of obtaining the current updating data set until the trained network model obtained by training the data in the current updating data set meets the model convergence condition, and determining the current updating data set as the de-noised data set.
5. The method of claim 4, wherein said cycling through the steps of obtaining the current set of updated data comprises:
and after the noise data in the data set to be denoised is determined each time, removing the noise data in the data set to be denoised to obtain the current updated data set.
6. The method of claim 1, further comprising:
after the data denoising request is received, classifying data in the data set to be denoised to obtain at least one data class to be denoised;
and carrying out noise removal on each class of data to be denoised to obtain a denoised data set corresponding to each class of data to be denoised.
7. The method of claim 1, wherein the determining noise data in the set of data to be denoised according to the first mean distance comprises:
and when the first average distance of any data is larger than a preset threshold value, determining the corresponding data as the noise data.
8. The method of claim 1, further comprising:
inputting data in the updated data set into a network model to be trained as sample data, training the network model to be trained to obtain a trained network model, and then obtaining an output result of the trained network model;
inputting the output result into a preset loss model to obtain a loss result;
and determining whether the trained network model meets the model convergence condition or not according to the loss result.
9. The method according to any one of claims 1 to 8, further comprising:
determining a third average distance between each noise data removed in a previous round of data denoising and the data in the update data set;
when the third average distance of any noise data is smaller than a preset threshold value, determining the corresponding noise data as recall data;
adding recall data to the update data set to form an update data set with the recall data;
and adopting the updated data set with the recall data to train the trained network model again until the trained network model meets the model convergence condition.
10. The method of claim 9, further comprising:
when any data is determined to be the noise data and the number of times of removal is greater than a number-of-times threshold, the data is not recalled.
11. A data denoising apparatus, comprising:
the first training module is used for training a preset algorithm model by adopting data in a data set to be denoised to obtain a trained algorithm model;
the first determining module is used for determining the trained algorithm model as a network model to be trained and determining a first average distance between each data in the data set to be denoised and other data in the data set to be denoised when the trained algorithm model does not meet a model convergence condition;
a second determining module, configured to determine noise data in the data set to be denoised according to the first average distance;
the removing module is used for removing the noise data in the data set to be denoised to obtain an updated data set;
the training module is used for inputting the data in the updated data set into a network model to be trained as sample data, and training the network model to be trained to obtain a trained network model;
and the third determining module is used for determining the updated data set as a de-noised data set when the trained network model obtained by training the data in the updated data set meets the model convergence condition.
12. A data denoising apparatus, comprising:
a memory for storing executable instructions; a processor for implementing the method of denoising data of any one of claims 1-10 when executing executable instructions stored in the memory.
13. A computer-readable storage medium having stored thereon executable instructions for causing a processor to implement the method of denoising data according to any one of claims 1 through 10 when the executable instructions are executed.
CN202011048598.3A 2020-09-29 2020-09-29 Data denoising method, device and equipment and computer readable storage medium Pending CN112115131A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011048598.3A CN112115131A (en) 2020-09-29 2020-09-29 Data denoising method, device and equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011048598.3A CN112115131A (en) 2020-09-29 2020-09-29 Data denoising method, device and equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN112115131A true CN112115131A (en) 2020-12-22

Family

ID=73796829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011048598.3A Pending CN112115131A (en) 2020-09-29 2020-09-29 Data denoising method, device and equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112115131A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112711578A (en) * 2020-12-30 2021-04-27 陈静 Big data denoising method for cloud computing service and cloud computing financial server
CN112988845A (en) * 2021-04-01 2021-06-18 毕延杰 Data information processing method and information service platform in big data service scene
US11979311B2 (en) 2021-12-10 2024-05-07 Cisco Technology, Inc. User-assisted training data denoising for predictive systems

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112711578A (en) * 2020-12-30 2021-04-27 陈静 Big data denoising method for cloud computing service and cloud computing financial server
CN112711578B (en) * 2020-12-30 2021-09-21 深圳市全景网络有限公司 Big data denoising method for cloud computing service and cloud computing financial server
CN112988845A (en) * 2021-04-01 2021-06-18 毕延杰 Data information processing method and information service platform in big data service scene
CN112988845B (en) * 2021-04-01 2021-11-16 湖南机械之家信息科技有限公司 Data information processing method and information service platform in big data service scene
US11979311B2 (en) 2021-12-10 2024-05-07 Cisco Technology, Inc. User-assisted training data denoising for predictive systems

Similar Documents

Publication Publication Date Title
JP7193252B2 (en) Captioning image regions
CN109299237B (en) Cyclic network man-machine conversation method based on actor critic reinforcement learning algorithm
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN111079601A (en) Video content description method, system and device based on multi-mode attention mechanism
CN111523640B (en) Training method and device for neural network model
CN109522950B (en) Image scoring model training method and device and image scoring method and device
CN115129848B (en) Method, device, equipment and medium for processing visual question-answering task
CN112115131A (en) Data denoising method, device and equipment and computer readable storage medium
CN111294646A (en) Video processing method, device, equipment and storage medium
CN109919221B (en) Image description method based on bidirectional double-attention machine
CN109783666A (en) A kind of image scene map generation method based on iteration fining
WO2020238353A1 (en) Data processing method and apparatus, storage medium, and electronic apparatus
CN112418292B (en) Image quality evaluation method, device, computer equipment and storage medium
CN110795944A (en) Recommended content processing method and device, and emotion attribute determining method and device
CN113704460B (en) Text classification method and device, electronic equipment and storage medium
CN110263218B (en) Video description text generation method, device, equipment and medium
CN111241992B (en) Face recognition model construction method, recognition method, device, equipment and storage medium
CN111597341A (en) Document level relation extraction method, device, equipment and storage medium
CN111858898A (en) Text processing method and device based on artificial intelligence and electronic equipment
WO2024083121A1 (en) Data processing method and apparatus
CN111046655B (en) Data processing method and device and computer readable storage medium
CN116385937A (en) Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework
CN115221369A (en) Visual question-answer implementation method and visual question-answer inspection model-based method
CN111626058B (en) Based on CR 2 Image-text double-coding realization method and system of neural network
CN113569018A (en) Question and answer pair mining method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40036281

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination