CN112182257A

CN112182257A - Artificial intelligence data cleaning method based on neural network

Info

Publication number: CN112182257A
Application number: CN202010872303.8A
Authority: CN
Inventors: 胡程远
Original assignee: Hefei Sanen Information Technology Co ltd
Current assignee: Hefei Sanen Information Technology Co ltd
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2021-01-05

Abstract

The invention relates to a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industrial field classification, the basic classification comprises images, voice, texts and videos, the industrial field classification is further carried out on the basic classification, the industrial field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted and communication and the like, then a small amount of manual marking is carried out on the data to be cleaned, preparation is carried out for cleaning the data, an initial model is trained, the initial model establishes a corresponding function curve according to the fitting and classifying module, and then the trained initial model is used for predicting the rest of the uncleaned data. Has the advantages that: the method can further enrich the characteristic combination of the training set and improve the generalization ability of the model, and the richer the data characteristics in the learning process of the model, the stronger the generalization ability, and the wider the application scene of the prediction model.

Description

Artificial intelligence data cleaning method based on neural network

Technical Field

The invention relates to the technical field of data cleaning processing, in particular to an artificial intelligence data cleaning method based on a neural network.

Background

Data cleansing refers to finding recognizable errors in data files, deleting duplicate information and correcting existing errors, providing data consistency. In the field of artificial intelligence, it is important to keep data "clean", and if wrong or invalid data is input, the output will affect the effect and generate errors. The traditional data cleaning method comprises the steps of checking data consistency, processing invalid values, identifying data conflicts and the like, and the whole process comprises multiple examination, verification and labeling. In most data and artificial intelligence floor projects, data cleaning is one of the most cost-consuming works, clean data is the basis of subsequent research and analysis, data cleaning is a work type with heavy personal labor demand and extremely high cost, related technicians spend a large amount of time for cleaning a data set, the data cleaning can occupy 80% of project algorithm working time without exaggeration, the time for really analyzing data and training a model only occupies about 20%, a large amount of manpower is consumed, and the data cleaning is not beneficial to popularization and application of a cleaning data system.

Disclosure of Invention

The invention aims to provide an artificial intelligence data cleaning method based on a neural network, which changes the boundary of model decision, improves the classification accuracy and improves the generalization ability of a model, and the richer the data characteristics in the model learning process, the stronger the generalization ability and the wider the scene of the prediction model, and is realized by the following scheme.

In order to achieve the above purpose, the invention adopts the technical scheme that: an artificial intelligence data cleaning method based on a neural network is characterized by comprising a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industry field classification, the basic classification comprises images, voice, texts and videos, the industry field classification is further carried out on the basic classification, and the industry field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted, communication and the like;

then, a small amount of manual labeling is carried out on data needing to be cleaned, preparation is made for cleaning the data, an initial model is trained, the initial model establishes a corresponding function curve according to the fitting and classifying module, then the trained initial model is used for predicting the rest unwashed data, three data sets { determined clean data, determined dirty data and uncertain data } are established, a deep learning model is formed, a small amount of uncertain data and the determined clean data are artificially corrected, the training set is added into a training set for fine tuning the training model again, samples close to a decision boundary obtained in prediction of the deep learning model when the uncertain data are selected are repeated, the model is trained by iteration data, the model can receive new corrected data and fine tune the model each time, the decision boundary of the model is directly changed through the process, the accuracy of classification is improved.

Further, when the cleaning data has a plurality of features, the cleaned data set can be further refined into { { A feature determined clean data, A feature determined dirty data, A feature uncertain data }, { B feature determined clean data, B feature determined dirty data, B feature uncertain data } … { K feature determined clean data, K feature determined dirty data, K feature uncertain data } … }, and by means of a combination strategy of the plurality of features, cleaning of the unknown sample of the state under the deep learning model can be achieved, and generalization capability of the model is improved.

Furthermore, a neural network model connected with the cleaning data is arranged at the decision boundary data output end, the data is learned through a deep neural network, the characteristic values of the marked data are extracted, the characteristic values of different data are fitted and clustered, and a network model and the number of layers of the fitted characteristic values are designed.

Furthermore, the clean data output end is provided with a training model connected with the neural network model, and the demodulated signal is processed through the training model, so that the characteristic value in the clean data is correspondingly restored, and the subsequent data can be finely adjusted and corrected conveniently.

The invention has the technical effects that: the model is obtained through iterative data training, the model can receive new corrected data and fine-tune the model in each iteration, the decision boundary of the model is directly changed through the process, the classification accuracy is improved, the feature combination of the training set can be further enriched by adding the data selected by diversity cleaning into the training set, the generalization capability of the model is improved, the data features in the model learning process are richer, the generalization capability is stronger, and the applicable scenes of the prediction model are wider.

Drawings

FIG. 1 is a schematic diagram of the present invention.

Detailed Description

Referring to the attached figure 1, the artificial intelligence data cleaning method based on the neural network comprises a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industry field classification, and the basic classification comprises images, voice, texts and videos. Further performing industry field classification on the basis classification, wherein the industry field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted communication and the like;

According to the specific embodiment of the invention, when the cleaning data has multiple features, the cleaned data set can be further refined into { { A feature determined clean data, A feature determined dirty data, A feature uncertain data }, { B feature determined clean data, B feature determined dirty data, B feature uncertain data } … { K feature determined clean data, K feature determined dirty data, K feature uncertain data } … }, and through a combination strategy of multiple features, cleaning of an unknown state sample under a deep learning model can be realized, and generalization capability of the model is improved.

The specific embodiment of the invention is that the decision boundary data output end is provided with a neural network model connected with the cleaning data, the data is learned through a deep neural network, the characteristic value of the marked data is extracted, the characteristic values of different data are fitted and clustered, and the network model and the number of layers of the fitted characteristic values are designed.

The specific embodiment of the invention is that the clean data output end is provided with a training model connected with a neural network model, and the demodulated signal is processed by the training model, so that the characteristic value in the clean data is correspondingly restored, and the fine adjustment and the correction of subsequent data are facilitated.

The specific embodiment of the invention is that, in particular, shopping malls, blocks, vehicles, living rooms and the like clean data by noise and clean voice of a given environment, screen out data which do not meet requirements, and keep clean data. And marking the data, and marking the content of the data, the speaking start time point and the speaking end time point. The data are learned through a deep neural network, the characteristic values of the marked data are extracted, the characteristic values of different data are fitted and clustered, and a network model and the number of layers of the fitted characteristic values are designed. The demodulated signal is processed by the training model to restore the speaker's voice.

The specific embodiment of the invention is that when medical care personnel treats diseases, the system can clean the noise of non-medical care personnel and the clean voice of the medical care personnel, brush out the noise, retain and keep the clean data of the medical care personnel, label the data, mark the content of the data, the corresponding speaking personnel learn the data through a deep neural network, extract the characteristic value of the data, namely the voice of the speaking personnel is different, fit and cluster different characteristic values of the data, design the network model and the number of layers of the fit characteristic values, process the demodulated signal through a training model, and restore the voice of the speaking personnel and the corresponding personnel.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. An artificial intelligence data cleaning method based on a neural network is characterized by comprising a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industry field classification, the basic classification comprises images, voice, texts and videos, the industry field classification is further carried out on the basic classification, and the industry field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted, communication and the like;

2. The artificial intelligence data cleaning method based on the neural network as claimed in claim 1, wherein when the cleaning data has a plurality of features, the cleaned data set can be further refined into { { a feature determined clean data, a feature determined dirty data, a feature uncertain data }, { B feature determined clean data, B feature determined dirty data, B feature uncertain data } … { K feature determined clean data, K feature determined dirty data, K feature uncertain data } … }, and by a combination strategy of the plurality of features, cleaning of unknown state samples under the deep learning model can be achieved, and generalization capability of the model is improved.

3. The artificial intelligence data cleaning method based on the neural network as claimed in claim 1, wherein the decision boundary data output end is provided with a neural network model connected with the cleaning data, the data is learned through a deep neural network, the characteristic values of the labeled data are extracted, the characteristic values of different data are fitted and clustered, and the network model and the number of layers of the fitted characteristic values are designed.

4. The artificial intelligence data cleaning method based on the neural network as claimed in claim 3, wherein the clean data output end is provided with a training model connected with the neural network model, and the demodulated signal is processed by the training model, so as to restore the characteristic value in the clean data correspondingly, and facilitate the fine tuning and correction of the subsequent data.