CN112182257A - Artificial intelligence data cleaning method based on neural network - Google Patents
Artificial intelligence data cleaning method based on neural network Download PDFInfo
- Publication number
- CN112182257A CN112182257A CN202010872303.8A CN202010872303A CN112182257A CN 112182257 A CN112182257 A CN 112182257A CN 202010872303 A CN202010872303 A CN 202010872303A CN 112182257 A CN112182257 A CN 112182257A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- neural network
- classification
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004140 cleaning Methods 0.000 title claims abstract description 32
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000013528 artificial neural network Methods 0.000 title claims description 15
- 238000013473 artificial intelligence Methods 0.000 title claims description 12
- 238000012549 training Methods 0.000 claims abstract description 22
- 238000004891 communication Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims abstract description 4
- 238000002360 preparation method Methods 0.000 claims abstract description 4
- 238000013136 deep learning model Methods 0.000 claims description 9
- 238000003062 neural network model Methods 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/45—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
- G06F16/435—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention relates to a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industrial field classification, the basic classification comprises images, voice, texts and videos, the industrial field classification is further carried out on the basic classification, the industrial field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted and communication and the like, then a small amount of manual marking is carried out on the data to be cleaned, preparation is carried out for cleaning the data, an initial model is trained, the initial model establishes a corresponding function curve according to the fitting and classifying module, and then the trained initial model is used for predicting the rest of the uncleaned data. Has the advantages that: the method can further enrich the characteristic combination of the training set and improve the generalization ability of the model, and the richer the data characteristics in the learning process of the model, the stronger the generalization ability, and the wider the application scene of the prediction model.
Description
Technical Field
The invention relates to the technical field of data cleaning processing, in particular to an artificial intelligence data cleaning method based on a neural network.
Background
Data cleansing refers to finding recognizable errors in data files, deleting duplicate information and correcting existing errors, providing data consistency. In the field of artificial intelligence, it is important to keep data "clean", and if wrong or invalid data is input, the output will affect the effect and generate errors. The traditional data cleaning method comprises the steps of checking data consistency, processing invalid values, identifying data conflicts and the like, and the whole process comprises multiple examination, verification and labeling. In most data and artificial intelligence floor projects, data cleaning is one of the most cost-consuming works, clean data is the basis of subsequent research and analysis, data cleaning is a work type with heavy personal labor demand and extremely high cost, related technicians spend a large amount of time for cleaning a data set, the data cleaning can occupy 80% of project algorithm working time without exaggeration, the time for really analyzing data and training a model only occupies about 20%, a large amount of manpower is consumed, and the data cleaning is not beneficial to popularization and application of a cleaning data system.
Disclosure of Invention
The invention aims to provide an artificial intelligence data cleaning method based on a neural network, which changes the boundary of model decision, improves the classification accuracy and improves the generalization ability of a model, and the richer the data characteristics in the model learning process, the stronger the generalization ability and the wider the scene of the prediction model, and is realized by the following scheme.
In order to achieve the above purpose, the invention adopts the technical scheme that: an artificial intelligence data cleaning method based on a neural network is characterized by comprising a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industry field classification, the basic classification comprises images, voice, texts and videos, the industry field classification is further carried out on the basic classification, and the industry field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted, communication and the like;
then, a small amount of manual labeling is carried out on data needing to be cleaned, preparation is made for cleaning the data, an initial model is trained, the initial model establishes a corresponding function curve according to the fitting and classifying module, then the trained initial model is used for predicting the rest unwashed data, three data sets { determined clean data, determined dirty data and uncertain data } are established, a deep learning model is formed, a small amount of uncertain data and the determined clean data are artificially corrected, the training set is added into a training set for fine tuning the training model again, samples close to a decision boundary obtained in prediction of the deep learning model when the uncertain data are selected are repeated, the model is trained by iteration data, the model can receive new corrected data and fine tune the model each time, the decision boundary of the model is directly changed through the process, the accuracy of classification is improved.
Further, when the cleaning data has a plurality of features, the cleaned data set can be further refined into { { A feature determined clean data, A feature determined dirty data, A feature uncertain data }, { B feature determined clean data, B feature determined dirty data, B feature uncertain data } … { K feature determined clean data, K feature determined dirty data, K feature uncertain data } … }, and by means of a combination strategy of the plurality of features, cleaning of the unknown sample of the state under the deep learning model can be achieved, and generalization capability of the model is improved.
Furthermore, a neural network model connected with the cleaning data is arranged at the decision boundary data output end, the data is learned through a deep neural network, the characteristic values of the marked data are extracted, the characteristic values of different data are fitted and clustered, and a network model and the number of layers of the fitted characteristic values are designed.
Furthermore, the clean data output end is provided with a training model connected with the neural network model, and the demodulated signal is processed through the training model, so that the characteristic value in the clean data is correspondingly restored, and the subsequent data can be finely adjusted and corrected conveniently.
The invention has the technical effects that: the model is obtained through iterative data training, the model can receive new corrected data and fine-tune the model in each iteration, the decision boundary of the model is directly changed through the process, the classification accuracy is improved, the feature combination of the training set can be further enriched by adding the data selected by diversity cleaning into the training set, the generalization capability of the model is improved, the data features in the model learning process are richer, the generalization capability is stronger, and the applicable scenes of the prediction model are wider.
Drawings
FIG. 1 is a schematic diagram of the present invention.
Detailed Description
Referring to the attached figure 1, the artificial intelligence data cleaning method based on the neural network comprises a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industry field classification, and the basic classification comprises images, voice, texts and videos. Further performing industry field classification on the basis classification, wherein the industry field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted communication and the like;
then, a small amount of manual labeling is carried out on data needing to be cleaned, preparation is made for cleaning the data, an initial model is trained, the initial model establishes a corresponding function curve according to the fitting and classifying module, then the trained initial model is used for predicting the rest unwashed data, three data sets { determined clean data, determined dirty data and uncertain data } are established, a deep learning model is formed, a small amount of uncertain data and the determined clean data are artificially corrected, the training set is added into a training set for fine tuning the training model again, samples close to a decision boundary obtained in prediction of the deep learning model when the uncertain data are selected are repeated, the model is trained by iteration data, the model can receive new corrected data and fine tune the model each time, the decision boundary of the model is directly changed through the process, the accuracy of classification is improved.
According to the specific embodiment of the invention, when the cleaning data has multiple features, the cleaned data set can be further refined into { { A feature determined clean data, A feature determined dirty data, A feature uncertain data }, { B feature determined clean data, B feature determined dirty data, B feature uncertain data } … { K feature determined clean data, K feature determined dirty data, K feature uncertain data } … }, and through a combination strategy of multiple features, cleaning of an unknown state sample under a deep learning model can be realized, and generalization capability of the model is improved.
The specific embodiment of the invention is that the decision boundary data output end is provided with a neural network model connected with the cleaning data, the data is learned through a deep neural network, the characteristic value of the marked data is extracted, the characteristic values of different data are fitted and clustered, and the network model and the number of layers of the fitted characteristic values are designed.
The specific embodiment of the invention is that the clean data output end is provided with a training model connected with a neural network model, and the demodulated signal is processed by the training model, so that the characteristic value in the clean data is correspondingly restored, and the fine adjustment and the correction of subsequent data are facilitated.
The specific embodiment of the invention is that, in particular, shopping malls, blocks, vehicles, living rooms and the like clean data by noise and clean voice of a given environment, screen out data which do not meet requirements, and keep clean data. And marking the data, and marking the content of the data, the speaking start time point and the speaking end time point. The data are learned through a deep neural network, the characteristic values of the marked data are extracted, the characteristic values of different data are fitted and clustered, and a network model and the number of layers of the fitted characteristic values are designed. The demodulated signal is processed by the training model to restore the speaker's voice.
The specific embodiment of the invention is that when medical care personnel treats diseases, the system can clean the noise of non-medical care personnel and the clean voice of the medical care personnel, brush out the noise, retain and keep the clean data of the medical care personnel, label the data, mark the content of the data, the corresponding speaking personnel learn the data through a deep neural network, extract the characteristic value of the data, namely the voice of the speaking personnel is different, fit and cluster different characteristic values of the data, design the network model and the number of layers of the fit characteristic values, process the demodulated signal through a training model, and restore the voice of the speaking personnel and the corresponding personnel.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (4)
1. An artificial intelligence data cleaning method based on a neural network is characterized by comprising a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industry field classification, the basic classification comprises images, voice, texts and videos, the industry field classification is further carried out on the basic classification, and the industry field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted, communication and the like;
then, a small amount of manual labeling is carried out on data needing to be cleaned, preparation is made for cleaning the data, an initial model is trained, the initial model establishes a corresponding function curve according to the fitting and classifying module, then the trained initial model is used for predicting the rest unwashed data, three data sets { determined clean data, determined dirty data and uncertain data } are established, a deep learning model is formed, a small amount of uncertain data and the determined clean data are artificially corrected, the training set is added into a training set for fine tuning the training model again, samples close to a decision boundary obtained in prediction of the deep learning model when the uncertain data are selected are repeated, the model is trained by iteration data, the model can receive new corrected data and fine tune the model each time, the decision boundary of the model is directly changed through the process, the accuracy of classification is improved.
2. The artificial intelligence data cleaning method based on the neural network as claimed in claim 1, wherein when the cleaning data has a plurality of features, the cleaned data set can be further refined into { { a feature determined clean data, a feature determined dirty data, a feature uncertain data }, { B feature determined clean data, B feature determined dirty data, B feature uncertain data } … { K feature determined clean data, K feature determined dirty data, K feature uncertain data } … }, and by a combination strategy of the plurality of features, cleaning of unknown state samples under the deep learning model can be achieved, and generalization capability of the model is improved.
3. The artificial intelligence data cleaning method based on the neural network as claimed in claim 1, wherein the decision boundary data output end is provided with a neural network model connected with the cleaning data, the data is learned through a deep neural network, the characteristic values of the labeled data are extracted, the characteristic values of different data are fitted and clustered, and the network model and the number of layers of the fitted characteristic values are designed.
4. The artificial intelligence data cleaning method based on the neural network as claimed in claim 3, wherein the clean data output end is provided with a training model connected with the neural network model, and the demodulated signal is processed by the training model, so as to restore the characteristic value in the clean data correspondingly, and facilitate the fine tuning and correction of the subsequent data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010872303.8A CN112182257A (en) | 2020-08-26 | 2020-08-26 | Artificial intelligence data cleaning method based on neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010872303.8A CN112182257A (en) | 2020-08-26 | 2020-08-26 | Artificial intelligence data cleaning method based on neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112182257A true CN112182257A (en) | 2021-01-05 |
Family
ID=73925035
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010872303.8A Pending CN112182257A (en) | 2020-08-26 | 2020-08-26 | Artificial intelligence data cleaning method based on neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112182257A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112860676A (en) * | 2021-02-06 | 2021-05-28 | 高云 | Data cleaning method applied to big data mining and business analysis and cloud server |
CN113033694A (en) * | 2021-04-09 | 2021-06-25 | 深圳亿嘉和科技研发有限公司 | Data cleaning method based on deep learning |
CN116303382A (en) * | 2023-02-10 | 2023-06-23 | 重庆见芒信息技术咨询服务有限公司 | Multidimensional big data cleaning method and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108470187A (en) * | 2018-02-26 | 2018-08-31 | 华南理工大学 | A kind of class imbalance question classification method based on expansion training dataset |
CN108764372A (en) * | 2018-06-08 | 2018-11-06 | Oppo广东移动通信有限公司 | Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set |
CN108875821A (en) * | 2018-06-08 | 2018-11-23 | Oppo广东移动通信有限公司 | The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing |
CN109582793A (en) * | 2018-11-23 | 2019-04-05 | 深圳前海微众银行股份有限公司 | Model training method, customer service system and data labeling system, readable storage medium storing program for executing |
CN110110754A (en) * | 2019-04-03 | 2019-08-09 | 华南理工大学 | Classification method based on the local imbalance problem of extensive error of cost |
CN110413786A (en) * | 2019-07-26 | 2019-11-05 | 北京智游网安科技有限公司 | Data processing method, intelligent terminal and storage medium based on web page text classification |
-
2020
- 2020-08-26 CN CN202010872303.8A patent/CN112182257A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108470187A (en) * | 2018-02-26 | 2018-08-31 | 华南理工大学 | A kind of class imbalance question classification method based on expansion training dataset |
CN108764372A (en) * | 2018-06-08 | 2018-11-06 | Oppo广东移动通信有限公司 | Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set |
CN108875821A (en) * | 2018-06-08 | 2018-11-23 | Oppo广东移动通信有限公司 | The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing |
CN109582793A (en) * | 2018-11-23 | 2019-04-05 | 深圳前海微众银行股份有限公司 | Model training method, customer service system and data labeling system, readable storage medium storing program for executing |
CN110110754A (en) * | 2019-04-03 | 2019-08-09 | 华南理工大学 | Classification method based on the local imbalance problem of extensive error of cost |
CN110413786A (en) * | 2019-07-26 | 2019-11-05 | 北京智游网安科技有限公司 | Data processing method, intelligent terminal and storage medium based on web page text classification |
Non-Patent Citations (1)
Title |
---|
李勇 等: "《复杂情感分析方法及其应用》", 29 February 2020, pages: 66 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112860676A (en) * | 2021-02-06 | 2021-05-28 | 高云 | Data cleaning method applied to big data mining and business analysis and cloud server |
CN113033694A (en) * | 2021-04-09 | 2021-06-25 | 深圳亿嘉和科技研发有限公司 | Data cleaning method based on deep learning |
CN113033694B (en) * | 2021-04-09 | 2023-04-07 | 深圳亿嘉和科技研发有限公司 | Data cleaning method based on deep learning |
CN116303382A (en) * | 2023-02-10 | 2023-06-23 | 重庆见芒信息技术咨询服务有限公司 | Multidimensional big data cleaning method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111126386B (en) | Sequence domain adaptation method based on countermeasure learning in scene text recognition | |
CN112182257A (en) | Artificial intelligence data cleaning method based on neural network | |
Cao et al. | Deep neural networks for learning graph representations | |
CN111461025B (en) | Signal identification method for self-evolving zero-sample learning | |
CN112712118A (en) | Medical text data oriented filtering method and system | |
CN108280164B (en) | Short text filtering and classifying method based on category related words | |
CN107679031B (en) | Advertisement and blog identification method based on stacking noise reduction self-coding machine | |
CN111710364B (en) | Method, device, terminal and storage medium for acquiring flora marker | |
CN106991355A (en) | The face identification method of the analytical type dictionary learning model kept based on topology | |
CN111582506A (en) | Multi-label learning method based on global and local label relation | |
CN112308129A (en) | Plant nematode data automatic labeling and classification identification method based on deep learning | |
CN110738660A (en) | Spine CT image segmentation method and device based on improved U-net | |
CN113593714A (en) | Method, system, equipment and medium for detecting multi-classification new coronary pneumonia cases | |
CN107729921B (en) | Machine active learning method and learning system | |
CN116152554A (en) | Knowledge-guided small sample image recognition system | |
CN114417836A (en) | Deep learning-based Chinese electronic medical record text semantic segmentation method | |
CN114188022A (en) | Clinical children cough intelligent pre-diagnosis system based on textCNN model | |
CN113360643A (en) | Electronic medical record data quality evaluation method based on short text classification | |
CN109344309A (en) | Extensive file and picture classification method and system are stacked based on convolutional neural networks | |
CN115062602B (en) | Sample construction method and device for contrast learning and computer equipment | |
CN111159370A (en) | Short-session new problem generation method, storage medium and man-machine interaction device | |
CN114999628B (en) | Method for searching for obvious characteristic of degenerative knee osteoarthritis by using machine learning | |
CN113591955B (en) | Method, system, equipment and medium for extracting global information of graph data | |
CN113516101B (en) | Electroencephalogram signal emotion recognition method based on network structure search | |
CN113553917A (en) | Office equipment identification method based on pulse transfer learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210105 |