CN112182257A - Artificial intelligence data cleaning method based on neural network - Google Patents

Artificial intelligence data cleaning method based on neural network Download PDF

Info

Publication number
CN112182257A
CN112182257A CN202010872303.8A CN202010872303A CN112182257A CN 112182257 A CN112182257 A CN 112182257A CN 202010872303 A CN202010872303 A CN 202010872303A CN 112182257 A CN112182257 A CN 112182257A
Authority
CN
China
Prior art keywords
data
model
neural network
classification
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010872303.8A
Other languages
Chinese (zh)
Inventor
胡程远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Sanen Information Technology Co ltd
Original Assignee
Hefei Sanen Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Sanen Information Technology Co ltd filed Critical Hefei Sanen Information Technology Co ltd
Priority to CN202010872303.8A priority Critical patent/CN112182257A/en
Publication of CN112182257A publication Critical patent/CN112182257A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/435Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industrial field classification, the basic classification comprises images, voice, texts and videos, the industrial field classification is further carried out on the basic classification, the industrial field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted and communication and the like, then a small amount of manual marking is carried out on the data to be cleaned, preparation is carried out for cleaning the data, an initial model is trained, the initial model establishes a corresponding function curve according to the fitting and classifying module, and then the trained initial model is used for predicting the rest of the uncleaned data. Has the advantages that: the method can further enrich the characteristic combination of the training set and improve the generalization ability of the model, and the richer the data characteristics in the learning process of the model, the stronger the generalization ability, and the wider the application scene of the prediction model.

Description

Artificial intelligence data cleaning method based on neural network
Technical Field
The invention relates to the technical field of data cleaning processing, in particular to an artificial intelligence data cleaning method based on a neural network.
Background
Data cleansing refers to finding recognizable errors in data files, deleting duplicate information and correcting existing errors, providing data consistency. In the field of artificial intelligence, it is important to keep data "clean", and if wrong or invalid data is input, the output will affect the effect and generate errors. The traditional data cleaning method comprises the steps of checking data consistency, processing invalid values, identifying data conflicts and the like, and the whole process comprises multiple examination, verification and labeling. In most data and artificial intelligence floor projects, data cleaning is one of the most cost-consuming works, clean data is the basis of subsequent research and analysis, data cleaning is a work type with heavy personal labor demand and extremely high cost, related technicians spend a large amount of time for cleaning a data set, the data cleaning can occupy 80% of project algorithm working time without exaggeration, the time for really analyzing data and training a model only occupies about 20%, a large amount of manpower is consumed, and the data cleaning is not beneficial to popularization and application of a cleaning data system.
Disclosure of Invention
The invention aims to provide an artificial intelligence data cleaning method based on a neural network, which changes the boundary of model decision, improves the classification accuracy and improves the generalization ability of a model, and the richer the data characteristics in the model learning process, the stronger the generalization ability and the wider the scene of the prediction model, and is realized by the following scheme.
In order to achieve the above purpose, the invention adopts the technical scheme that: an artificial intelligence data cleaning method based on a neural network is characterized by comprising a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industry field classification, the basic classification comprises images, voice, texts and videos, the industry field classification is further carried out on the basic classification, and the industry field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted, communication and the like;
then, a small amount of manual labeling is carried out on data needing to be cleaned, preparation is made for cleaning the data, an initial model is trained, the initial model establishes a corresponding function curve according to the fitting and classifying module, then the trained initial model is used for predicting the rest unwashed data, three data sets { determined clean data, determined dirty data and uncertain data } are established, a deep learning model is formed, a small amount of uncertain data and the determined clean data are artificially corrected, the training set is added into a training set for fine tuning the training model again, samples close to a decision boundary obtained in prediction of the deep learning model when the uncertain data are selected are repeated, the model is trained by iteration data, the model can receive new corrected data and fine tune the model each time, the decision boundary of the model is directly changed through the process, the accuracy of classification is improved.
Further, when the cleaning data has a plurality of features, the cleaned data set can be further refined into { { A feature determined clean data, A feature determined dirty data, A feature uncertain data }, { B feature determined clean data, B feature determined dirty data, B feature uncertain data } … { K feature determined clean data, K feature determined dirty data, K feature uncertain data } … }, and by means of a combination strategy of the plurality of features, cleaning of the unknown sample of the state under the deep learning model can be achieved, and generalization capability of the model is improved.
Furthermore, a neural network model connected with the cleaning data is arranged at the decision boundary data output end, the data is learned through a deep neural network, the characteristic values of the marked data are extracted, the characteristic values of different data are fitted and clustered, and a network model and the number of layers of the fitted characteristic values are designed.
Furthermore, the clean data output end is provided with a training model connected with the neural network model, and the demodulated signal is processed through the training model, so that the characteristic value in the clean data is correspondingly restored, and the subsequent data can be finely adjusted and corrected conveniently.
The invention has the technical effects that: the model is obtained through iterative data training, the model can receive new corrected data and fine-tune the model in each iteration, the decision boundary of the model is directly changed through the process, the classification accuracy is improved, the feature combination of the training set can be further enriched by adding the data selected by diversity cleaning into the training set, the generalization capability of the model is improved, the data features in the model learning process are richer, the generalization capability is stronger, and the applicable scenes of the prediction model are wider.
Drawings
FIG. 1 is a schematic diagram of the present invention.
Detailed Description
Referring to the attached figure 1, the artificial intelligence data cleaning method based on the neural network comprises a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industry field classification, and the basic classification comprises images, voice, texts and videos. Further performing industry field classification on the basis classification, wherein the industry field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted communication and the like;
then, a small amount of manual labeling is carried out on data needing to be cleaned, preparation is made for cleaning the data, an initial model is trained, the initial model establishes a corresponding function curve according to the fitting and classifying module, then the trained initial model is used for predicting the rest unwashed data, three data sets { determined clean data, determined dirty data and uncertain data } are established, a deep learning model is formed, a small amount of uncertain data and the determined clean data are artificially corrected, the training set is added into a training set for fine tuning the training model again, samples close to a decision boundary obtained in prediction of the deep learning model when the uncertain data are selected are repeated, the model is trained by iteration data, the model can receive new corrected data and fine tune the model each time, the decision boundary of the model is directly changed through the process, the accuracy of classification is improved.
According to the specific embodiment of the invention, when the cleaning data has multiple features, the cleaned data set can be further refined into { { A feature determined clean data, A feature determined dirty data, A feature uncertain data }, { B feature determined clean data, B feature determined dirty data, B feature uncertain data } … { K feature determined clean data, K feature determined dirty data, K feature uncertain data } … }, and through a combination strategy of multiple features, cleaning of an unknown state sample under a deep learning model can be realized, and generalization capability of the model is improved.
The specific embodiment of the invention is that the decision boundary data output end is provided with a neural network model connected with the cleaning data, the data is learned through a deep neural network, the characteristic value of the marked data is extracted, the characteristic values of different data are fitted and clustered, and the network model and the number of layers of the fitted characteristic values are designed.
The specific embodiment of the invention is that the clean data output end is provided with a training model connected with a neural network model, and the demodulated signal is processed by the training model, so that the characteristic value in the clean data is correspondingly restored, and the fine adjustment and the correction of subsequent data are facilitated.
The specific embodiment of the invention is that, in particular, shopping malls, blocks, vehicles, living rooms and the like clean data by noise and clean voice of a given environment, screen out data which do not meet requirements, and keep clean data. And marking the data, and marking the content of the data, the speaking start time point and the speaking end time point. The data are learned through a deep neural network, the characteristic values of the marked data are extracted, the characteristic values of different data are fitted and clustered, and a network model and the number of layers of the fitted characteristic values are designed. The demodulated signal is processed by the training model to restore the speaker's voice.
The specific embodiment of the invention is that when medical care personnel treats diseases, the system can clean the noise of non-medical care personnel and the clean voice of the medical care personnel, brush out the noise, retain and keep the clean data of the medical care personnel, label the data, mark the content of the data, the corresponding speaking personnel learn the data through a deep neural network, extract the characteristic value of the data, namely the voice of the speaking personnel is different, fit and cluster different characteristic values of the data, design the network model and the number of layers of the fit characteristic values, process the demodulated signal through a training model, and restore the voice of the speaking personnel and the corresponding personnel.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (4)

1. An artificial intelligence data cleaning method based on a neural network is characterized by comprising a fitting and classifying module connected with data to be cleaned, wherein the classifying module comprises a basic classification and an industry field classification, the basic classification comprises images, voice, texts and videos, the industry field classification is further carried out on the basic classification, and the industry field classification comprises the fields of finance, medical treatment, safety, multimedia, laws and regulations, vehicle-mounted, communication and the like;
then, a small amount of manual labeling is carried out on data needing to be cleaned, preparation is made for cleaning the data, an initial model is trained, the initial model establishes a corresponding function curve according to the fitting and classifying module, then the trained initial model is used for predicting the rest unwashed data, three data sets { determined clean data, determined dirty data and uncertain data } are established, a deep learning model is formed, a small amount of uncertain data and the determined clean data are artificially corrected, the training set is added into a training set for fine tuning the training model again, samples close to a decision boundary obtained in prediction of the deep learning model when the uncertain data are selected are repeated, the model is trained by iteration data, the model can receive new corrected data and fine tune the model each time, the decision boundary of the model is directly changed through the process, the accuracy of classification is improved.
2. The artificial intelligence data cleaning method based on the neural network as claimed in claim 1, wherein when the cleaning data has a plurality of features, the cleaned data set can be further refined into { { a feature determined clean data, a feature determined dirty data, a feature uncertain data }, { B feature determined clean data, B feature determined dirty data, B feature uncertain data } … { K feature determined clean data, K feature determined dirty data, K feature uncertain data } … }, and by a combination strategy of the plurality of features, cleaning of unknown state samples under the deep learning model can be achieved, and generalization capability of the model is improved.
3. The artificial intelligence data cleaning method based on the neural network as claimed in claim 1, wherein the decision boundary data output end is provided with a neural network model connected with the cleaning data, the data is learned through a deep neural network, the characteristic values of the labeled data are extracted, the characteristic values of different data are fitted and clustered, and the network model and the number of layers of the fitted characteristic values are designed.
4. The artificial intelligence data cleaning method based on the neural network as claimed in claim 3, wherein the clean data output end is provided with a training model connected with the neural network model, and the demodulated signal is processed by the training model, so as to restore the characteristic value in the clean data correspondingly, and facilitate the fine tuning and correction of the subsequent data.
CN202010872303.8A 2020-08-26 2020-08-26 Artificial intelligence data cleaning method based on neural network Pending CN112182257A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010872303.8A CN112182257A (en) 2020-08-26 2020-08-26 Artificial intelligence data cleaning method based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010872303.8A CN112182257A (en) 2020-08-26 2020-08-26 Artificial intelligence data cleaning method based on neural network

Publications (1)

Publication Number Publication Date
CN112182257A true CN112182257A (en) 2021-01-05

Family

ID=73925035

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010872303.8A Pending CN112182257A (en) 2020-08-26 2020-08-26 Artificial intelligence data cleaning method based on neural network

Country Status (1)

Country Link
CN (1) CN112182257A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860676A (en) * 2021-02-06 2021-05-28 高云 Data cleaning method applied to big data mining and business analysis and cloud server
CN113033694A (en) * 2021-04-09 2021-06-25 深圳亿嘉和科技研发有限公司 Data cleaning method based on deep learning
CN116303382A (en) * 2023-02-10 2023-06-23 重庆见芒信息技术咨询服务有限公司 Multidimensional big data cleaning method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108470187A (en) * 2018-02-26 2018-08-31 华南理工大学 A kind of class imbalance question classification method based on expansion training dataset
CN108764372A (en) * 2018-06-08 2018-11-06 Oppo广东移动通信有限公司 Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN109582793A (en) * 2018-11-23 2019-04-05 深圳前海微众银行股份有限公司 Model training method, customer service system and data labeling system, readable storage medium storing program for executing
CN110110754A (en) * 2019-04-03 2019-08-09 华南理工大学 Classification method based on the local imbalance problem of extensive error of cost
CN110413786A (en) * 2019-07-26 2019-11-05 北京智游网安科技有限公司 Data processing method, intelligent terminal and storage medium based on web page text classification

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108470187A (en) * 2018-02-26 2018-08-31 华南理工大学 A kind of class imbalance question classification method based on expansion training dataset
CN108764372A (en) * 2018-06-08 2018-11-06 Oppo广东移动通信有限公司 Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN109582793A (en) * 2018-11-23 2019-04-05 深圳前海微众银行股份有限公司 Model training method, customer service system and data labeling system, readable storage medium storing program for executing
CN110110754A (en) * 2019-04-03 2019-08-09 华南理工大学 Classification method based on the local imbalance problem of extensive error of cost
CN110413786A (en) * 2019-07-26 2019-11-05 北京智游网安科技有限公司 Data processing method, intelligent terminal and storage medium based on web page text classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李勇 等: "《复杂情感分析方法及其应用》", 29 February 2020, pages: 66 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860676A (en) * 2021-02-06 2021-05-28 高云 Data cleaning method applied to big data mining and business analysis and cloud server
CN113033694A (en) * 2021-04-09 2021-06-25 深圳亿嘉和科技研发有限公司 Data cleaning method based on deep learning
CN113033694B (en) * 2021-04-09 2023-04-07 深圳亿嘉和科技研发有限公司 Data cleaning method based on deep learning
CN116303382A (en) * 2023-02-10 2023-06-23 重庆见芒信息技术咨询服务有限公司 Multidimensional big data cleaning method and system

Similar Documents

Publication Publication Date Title
CN111126386B (en) Sequence domain adaptation method based on countermeasure learning in scene text recognition
CN112182257A (en) Artificial intelligence data cleaning method based on neural network
Cao et al. Deep neural networks for learning graph representations
CN111461025B (en) Signal identification method for self-evolving zero-sample learning
CN112712118A (en) Medical text data oriented filtering method and system
CN108280164B (en) Short text filtering and classifying method based on category related words
CN107679031B (en) Advertisement and blog identification method based on stacking noise reduction self-coding machine
CN111710364B (en) Method, device, terminal and storage medium for acquiring flora marker
CN106991355A (en) The face identification method of the analytical type dictionary learning model kept based on topology
CN111582506A (en) Multi-label learning method based on global and local label relation
CN112308129A (en) Plant nematode data automatic labeling and classification identification method based on deep learning
CN110738660A (en) Spine CT image segmentation method and device based on improved U-net
CN113593714A (en) Method, system, equipment and medium for detecting multi-classification new coronary pneumonia cases
CN107729921B (en) Machine active learning method and learning system
CN116152554A (en) Knowledge-guided small sample image recognition system
CN114417836A (en) Deep learning-based Chinese electronic medical record text semantic segmentation method
CN114188022A (en) Clinical children cough intelligent pre-diagnosis system based on textCNN model
CN113360643A (en) Electronic medical record data quality evaluation method based on short text classification
CN109344309A (en) Extensive file and picture classification method and system are stacked based on convolutional neural networks
CN115062602B (en) Sample construction method and device for contrast learning and computer equipment
CN111159370A (en) Short-session new problem generation method, storage medium and man-machine interaction device
CN114999628B (en) Method for searching for obvious characteristic of degenerative knee osteoarthritis by using machine learning
CN113591955B (en) Method, system, equipment and medium for extracting global information of graph data
CN113516101B (en) Electroencephalogram signal emotion recognition method based on network structure search
CN113553917A (en) Office equipment identification method based on pulse transfer learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210105