CN116701378A - Method, device, processor and computer readable storage medium for realizing data cleaning based on artificial intelligence in information creation environment - Google Patents

Method, device, processor and computer readable storage medium for realizing data cleaning based on artificial intelligence in information creation environment Download PDF

Info

Publication number
CN116701378A
CN116701378A CN202310818372.4A CN202310818372A CN116701378A CN 116701378 A CN116701378 A CN 116701378A CN 202310818372 A CN202310818372 A CN 202310818372A CN 116701378 A CN116701378 A CN 116701378A
Authority
CN
China
Prior art keywords
data
score
artificial intelligence
cleaning
specifically
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310818372.4A
Other languages
Chinese (zh)
Inventor
孙艳彬
魏明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Primeton Information Technology Co ltd
Original Assignee
Primeton Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Primeton Information Technology Co ltd filed Critical Primeton Information Technology Co ltd
Priority to CN202310818372.4A priority Critical patent/CN116701378A/en
Publication of CN116701378A publication Critical patent/CN116701378A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method for realizing data cleaning based on artificial intelligence in a credit-wound environment, which comprises the following steps of. The invention also relates to a device, a processor and a computer readable storage medium thereof for realizing the data cleaning based on the artificial intelligence in the information creation environment. The method, the device, the processor and the computer readable storage medium thereof for realizing data cleaning based on artificial intelligence in the information creation environment effectively combine a machine learning algorithm and a bee colony optimization neural network so as to improve the efficiency and the accuracy of data cleaning. The method can improve the efficiency and accuracy of data cleaning, ensure the quality and consistency of data, and reduce manual intervention so as to support more accurate data analysis and decision.

Description

Method, device, processor and computer readable storage medium for realizing data cleaning based on artificial intelligence in information creation environment
Technical Field
The invention relates to the field of computers, in particular to the field of data management, and specifically relates to a method, a device, a processor and a computer readable storage medium for realizing data cleaning based on artificial intelligence in a credit environment.
Background
The traditional data cleaning method generally requires a large amount of manual intervention, is time-consuming and labor-consuming, is easy to introduce human errors, and has limited improvement on data quality. The prior art has the following defects: efficiency and accuracy are limited: traditional data cleaning methods generally rely on manual operations, which not only lead to inefficiency, but also can introduce human errors that affect the accuracy of data cleaning; lack of automation and intelligence: the existing data cleaning method often lacks automation and intelligent capability, and cannot effectively process large-scale, high-dimensional and complex data; the data quality assessment is incomplete: existing data cleansing methods typically only focus on the quality, e.g., integrity or consistency, of one or more aspects of the data, while ignoring other important quality factors, such as timeliness, accuracy, validity, and uniqueness; therefore, how to improve the efficiency and accuracy of data cleaning, reduce manual intervention, and ensure the overall quality of data becomes a problem to be solved urgently.
CN201711059055.X proposes a data cleaning integration method and system, the method includes the following steps: acquiring data to be cleaned; identifying and determining formula data and non-formula data of the data to be cleaned; calling a formula editor to identify the formula data and converting the formula data into a document in a non-formula format; and executing data cleaning on the document in the non-formula format and the non-formula data to obtain cleaned data, restoring the cleaned document in the non-formula format into a formula editor format, and inserting the document in the corresponding position to clean the whole data. The technical scheme provided by the invention has the advantage that the formula can be processed.
CN202211699381.8 proposes a method, an apparatus, an electronic device and a readable medium for cleaning data of a database, where the method includes: acquiring message queue data in a message queue cluster; analyzing the message queue data to determine a data type of the message queue data; generating a target configuration file matched with the message queue data according to the data type and the service requirement of the service to which the message queue data belong; and calling the cleaning thread to load the target configuration file so as to enable the cleaning thread to perform data cleaning on the message queue data to obtain cleaning data. The problem that index combination cannot be customized rapidly according to the requirement types of different scenes is solved by customizing the target configuration file matched with the message queue data and the service requirements and then cleaning the data according to the target configuration file.
CN202310101627.5 proposes a data cleaning method, device, apparatus and medium, wherein the method comprises: acquiring a preset field triplet list and a preset task data list; acquiring a first data list according to the preset task data list; acquiring a first field name list according to the first data list and the preset field triplet list; acquiring a second field name list according to the first field name list; acquiring a target data list according to the second field name list and the preset task data list, so that the target data list is subjected to data cleaning; as can be seen, only one data cleaning judging condition with strong universality is used for cleaning the data to be processed, so that the storage capacity is small, and the resource waste is avoided; the data to be processed is processed according to a plurality of methods to obtain a target data list, so that the overall data processing amount of the system is reduced, and the operation efficiency of the system is improved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a method, a device, a processor and a computer readable storage medium thereof for realizing data cleaning based on artificial intelligence in a credit-wound environment with high accuracy, good consistency and wider application range.
In order to achieve the above object, a method, an apparatus, a processor and a computer readable storage medium thereof for implementing data cleansing based on artificial intelligence in a signal creation environment according to the present invention are as follows:
the method for realizing data cleaning based on artificial intelligence in the created environment is mainly characterized by comprising the following steps:
(1) Preprocessing data and carrying out characteristic engineering;
(2) Building and training a machine learning model;
(3) Application and evaluation of the cleaning result were performed.
Preferably, the data preprocessing in the step (1) specifically includes data cleaning, noise and abnormal value removal and missing value processing.
Preferably, the feature engineering performed in the step (1) specifically includes: the appropriate features are selected and encoded.
Preferably, the step (2) specifically includes the following steps:
(2.1) marking data, and preparing a marked data set;
(2.2) extracting features from the raw data as input to a machine learning model;
(2.3) selecting a proper machine learning model according to the characteristics of the cleaning task, performing model training by using the labeling data set, and optimizing model parameters;
(2.4) evaluating the performance of the model using the validation dataset and performing model tuning.
Preferably, the step (2.3) adopts an improved swarm optimization neural network, and updates the weight and deviation of the neural network in each iteration, and specifically comprises the following steps:
(2.3.1) initializing the location of bees, i.e. the weights and deviations of the neural network;
(2.3.2) assessing fitness of each bee by the performance of the neural network on the training data;
(2.3.3) selecting a new location according to the chaotic map and the distance-based selection strategy;
(2.3.4) if the new weights and deviations result in lower losses, updating the position of the bees;
(2.3.5) repeating steps (2.3.2) to (2.3.4) until the stop condition is satisfied.
Preferably, the step (3) specifically includes the following steps:
(3.1) defining an evaluation function for each factor of the data quality evaluation;
(3.2) calculating a score corresponding to each factor by an evaluation function of each factor;
(3.3) calculating an overall data quality score based on the scores of all factors;
(3.4) measuring the quality of the entire data set by the overall data quality score.
Preferably, the step (3.2) comprises calculating an integrity score C, specifically:
the integrity score C is calculated according to the following formula:
c= (1- (number of null values/number of total values)) ×100;
the step (3.2) comprises calculating a consistency score Cons, specifically:
the consistency score Cons is calculated according to the following formula:
cons= (1- (contradictory data amount/total data amount)). Times.100;
the step (3.2) comprises calculating a timeliness score T, specifically:
the timeliness score T is calculated according to the following formula:
t= (1- ((current date-date of data generation)/maximum acceptable date difference)) ×100; the step (3.2) comprises calculating an accuracy score A, specifically:
the accuracy score a is calculated according to the following formula:
a= (1- (error data number/total data number)) ×100;
the step (3.2) comprises calculating a effectiveness score V, specifically:
the effectiveness score V is calculated according to the following formula:
v= (1- (number of data not conforming to rule/total number of data)) ×100;
the step (3.2) comprises calculating a uniqueness score U, specifically:
the uniqueness score U is calculated according to the following formula:
u= (1- (number of duplicate data/number of total data)) ×100.
Preferably, in the step (3.3), an overall data quality score is calculated, specifically:
the overall data quality score is calculated according to the following formula:
Q=w1×C+w2×Cons+w3×T+w4×A+w5×V+w6×U;
where Q is the overall data quality score, C, cons, T, A, V, U represents the integrity score, consistency score, timeliness score, accuracy score, validity score, and uniqueness score, respectively, w1, w2, w3, w4, w5, w6 are the weight values of each factor.
The device for realizing artificial intelligence-based data cleaning in the information creation environment is mainly characterized by comprising the following components:
a processor configured to execute computer-executable instructions;
and a memory storing one or more computer-executable instructions which, when executed by the processor, perform the steps of the method for implementing data cleansing based on artificial intelligence in the aforementioned belief-creating environment.
The processor for realizing the data cleaning based on the artificial intelligence in the information creating environment is mainly characterized in that the processor is configured to execute computer executable instructions, and when the computer executable instructions are executed by the processor, the steps of the method for realizing the data cleaning based on the artificial intelligence in the information creating environment are realized.
The computer readable storage medium is characterized in that the computer program is stored thereon, and the computer program can be executed by a processor to realize the steps of the method for realizing data cleaning based on artificial intelligence in the created environment.
The method, the device, the processor and the computer readable storage medium thereof for realizing data cleaning based on artificial intelligence in the information creation environment effectively combine a machine learning algorithm and a bee colony optimization neural network so as to improve the efficiency and the accuracy of data cleaning. The method can improve the efficiency and accuracy of data cleaning, ensure the quality and consistency of data, and reduce manual intervention so as to support more accurate data analysis and decision.
Drawings
FIG. 1 is a flow chart of steps of a method for implementing data cleansing based on artificial intelligence in a signal creation environment of the present invention.
Fig. 2 is a schematic diagram of a neural network structure of a method for implementing data cleaning based on artificial intelligence in a signal-based environment of the present invention.
Detailed Description
In order to more clearly describe the technical contents of the present invention, a further description will be made below in connection with specific embodiments.
The method for realizing data cleaning based on artificial intelligence in the information creation environment comprises the following steps:
(1) Preprocessing data and carrying out characteristic engineering;
(2) Building and training a machine learning model;
(3) Application and evaluation of the cleaning result were performed.
As a preferred embodiment of the present invention, the data preprocessing in the step (1) specifically includes data cleaning, noise and outlier removal and missing value processing.
As a preferred embodiment of the present invention, the feature engineering performed in the step (1) specifically includes: the appropriate features are selected and encoded.
As a preferred embodiment of the present invention, the step (2) specifically includes the following steps:
(2.1) marking data, and preparing a marked data set;
(2.2) extracting features from the raw data as input to a machine learning model;
(2.3) selecting a proper machine learning model according to the characteristics of the cleaning task, performing model training by using the labeling data set, and optimizing model parameters;
(2.4) evaluating the performance of the model using the validation dataset and performing model tuning.
As a preferred embodiment of the present invention, the step (2.3) adopts an improved swarm optimization neural network, and updates the weight and deviation of the neural network at each iteration, specifically includes the following steps:
(2.3.1) initializing the location of bees, i.e. the weights and deviations of the neural network;
(2.3.2) assessing fitness of each bee by the performance of the neural network on the training data;
(2.3.3) selecting a new location according to the chaotic map and the distance-based selection strategy;
(2.3.4) if the new weights and deviations result in lower losses, updating the position of the bees;
(2.3.5) repeating steps (2.3.2) to (2.3.4) until the stop condition is satisfied.
As a preferred embodiment of the present invention, the step (3) specifically includes the following steps:
(3.1) defining an evaluation function for each factor of the data quality evaluation;
(3.2) calculating a score corresponding to each factor by an evaluation function of each factor;
(3.3) calculating an overall data quality score based on the scores of all factors;
(3.4) measuring the quality of the entire data set by the overall data quality score.
As a preferred embodiment of the present invention, the step (3.2) comprises calculating an integrity score C, specifically:
the integrity score C is calculated according to the following formula:
c= (1- (number of null values/number of total values)) ×100;
the step (3.2) comprises calculating a consistency score Cons, specifically:
the consistency score Cons is calculated according to the following formula:
cons= (1- (contradictory data amount/total data amount)). Times.100;
the step (3.2) comprises calculating a timeliness score T, specifically:
the timeliness score T is calculated according to the following formula:
t= (1- ((current date-date of data generation)/maximum acceptable date difference)) ×100;
the step (3.2) comprises calculating an accuracy score A, specifically:
the accuracy score a is calculated according to the following formula:
a= (1- (error data number/total data number)) ×100;
the step (3.2) comprises calculating a effectiveness score V, specifically:
the effectiveness score V is calculated according to the following formula:
v= (1- (number of data not conforming to rule/total number of data)) ×100;
the step (3.2) comprises calculating a uniqueness score U, specifically:
the uniqueness score U is calculated according to the following formula:
u= (1- (number of duplicate data/total number of data))=100.
As a preferred embodiment of the present invention, the step (3.3) calculates an overall data quality score, specifically:
the overall data quality score is calculated according to the following formula:
Q=w1×C+w2×Cons+w3×T+w4×A+w5×V+w6×U;
where Q is the overall data quality score, C, cons, T, A, V, U represents the integrity score, consistency score, timeliness score, accuracy score, validity score, and uniqueness score, respectively, w1, w2, w3, w4, w5, w6 are the weight values of each factor.
The device for realizing artificial intelligence based data cleaning in the information creation environment of the invention comprises:
a processor configured to execute computer-executable instructions;
and a memory storing one or more computer-executable instructions which, when executed by the processor, perform the steps of the method for implementing data cleansing based on artificial intelligence in the aforementioned belief-creating environment.
The processor for implementing data cleansing based on artificial intelligence in the information creating environment of the invention, wherein the processor is configured to execute computer executable instructions, which when executed by the processor, implement the steps of the method for implementing data cleansing based on artificial intelligence in the information creating environment.
The computer readable storage medium of the present invention has a computer program stored thereon, the computer program being executable by a processor to perform the steps of the method for implementing data cleansing based on artificial intelligence in a signal-creating environment as described above.
In the current trafficking environment, data quality becomes an important factor affecting data analysis and decision-making effects. Data cleansing, a key step in improving data quality, is the process of identifying and correcting or deleting errors, anomalies, and inconsistencies in data.
In a specific embodiment of the invention, a method for cleaning data by using machine learning and a bee colony optimization neural network is provided, and comprises data preprocessing, feature engineering, construction and training of a machine learning model, and application and evaluation of cleaning results. Machine learning algorithms and techniques are utilized to automatically detect and correct errors, anomalies, and inconsistencies in data.
Application of machine learning in data cleansing: the present invention proposes the use of machine learning models to automatically detect and correct errors, anomalies, and inconsistencies in data, rather than traditional manual cleaning methods. The method remarkably improves the efficiency and accuracy of data cleaning.
Bee colony optimization neural network: the invention effectively combines the bee colony optimization algorithm and the neural network to carry out the data cleaning task. The method not only improves the global searching capability of the neural network, but also increases the searching diversity through the chaotic mapping, and the searching is focused based on a distance selection strategy, so that the optimal solution can be found more quickly.
Overall data quality assessment: after cleaning, the invention provides a comprehensive data quality assessment mechanism, and considers the integrity, consistency, timeliness, accuracy, effectiveness and uniqueness of the data, thereby ensuring the data cleaning effect and the data quality.
End-to-end flow of data cleansing: the invention provides a complete data cleaning flow from data preprocessing, feature engineering, machine learning model construction, training, and cleaning result application and evaluation. The flow not only improves the efficiency and effect of data cleaning, but also greatly reduces the requirement of manual intervention.
The invention discloses a method for realizing data cleaning based on artificial intelligence in a credit-invasive environment, which comprises the following steps:
step one: preparation work before cleaning is performed.
Before using machine learning for data cleaning, the following preparation work is required:
data preprocessing: including data cleaning, noise and outlier removal, processing missing values, etc. These steps may reduce interference and errors in model training.
Characteristic engineering: suitable features are selected and encoded for use by the machine learning model. The quality of the feature engineering is critical to the model performance.
Step two: and constructing and training a machine learning model.
The data cleaning method involves the following steps:
1) And (3) data marking: a labeled dataset is prepared, containing the raw data and corresponding cleaning results. Labeling data may be obtained by manual cleaning, domain knowledge specialists, or other automated methods.
2) Feature extraction: features are extracted from the raw data for input to the machine learning model. Common features include statistical properties of the data, text features, time series features, and the like.
3) Model involves and trains: and selecting a proper machine learning model according to the characteristics of the cleaning task. Model training is carried out by using the labeling data set, and model parameters are optimized.
The invention provides an improved swarm optimization neural network. The basic architecture of the neural network is a multi-layer perceptron that extracts features from the data and classifies them. As shown in fig. 2:
the basic formula of the multi-layer perceptron is as follows:
h (l) =σ(W (l) h (l-1) +b (l) );
wherein h is (l) Is the output of the first layer, W (l) And b (l) Is the weight and bias of the first layer, σ is the activation function.
The bee colony optimization algorithm is an optimization algorithm for simulating the foraging behavior of bees. In the improved swarm optimization neural network, the algorithm is used to optimize the weights and bias of the neural network.
The improved swarm optimization algorithm introduces two new concepts: chaotic mapping and distance-based selection strategies.
Chaotic mapping: to increase the diversity of the bee colony search, chaotic mapping is used to update the location of the bees. The equation for the chaotic map is as follows:
x new =x old +λ×(1-2×Logistic(x old ));
wherein x is new Is a new position, x old Is the old location and λ is the parameter that controls the strength of the chaotic map. In the optimization process, the learning rate has important influence on the convergence speed and the model performance. If the learning rate is too high, the convergence is possibly too fast, and the global optimal solution is missed; if the learning rate is too small, it may result in too slow convergence or even a locally optimal solution.
The self-adaptive learning rate strategy is a method for dynamically adjusting the learning rate, so that the learning rate can be automatically adjusted according to the training progress and the change of the loss function. The formula for this strategy is as follows:
wherein lambda is t Is the learning rate of the t-th iteration lambda t-1 Is the learning rate of the t-1 th iteration, delta t Is a preset attenuation factor, and t is the number of iterations.
The self-adaptive learning rate strategy can enable the model to quickly converge in the initial stage of training, and reduces the learning rate when approaching to the global optimal solution so as to improve the performance and the robustness of the model.
Logistic (x) is a Logistic map and the calculation formula is as follows:
Logistic(x)=4x(1-x);
distance-based selection strategy: when selecting a new foraging site, the bees consider not only the quality of the foraging site, but also their distance from the current location. The closer the distance, the greater the likelihood of selection. The calculation formula of the distance d is as follows:
d=||x new -x old ||;
during training, the weights and bias of the neural network are updated using a modified swarm optimization algorithm. The optimization objective is to minimize the cross entropy loss function, which is formulated as follows:
wherein y is i Is a genuine tag, p (y i ) Is the prediction of the model, N is the total number of samples.
The weight and deviation of the neural network are updated for each iteration of the swarm optimization algorithm, and the method comprises the following steps:
1. the location of the bees, i.e. the weights and deviations of the neural network, are initialized.
2. The fitness of each bee, i.e. the performance on training data via a neural network, was evaluated.
3. According to the chaotic map and the distance-based selection strategy, a new position is selected.
4. If the new location (i.e., new weights and deviations) gets lower losses, the location of the bees is updated.
5. Steps 2-4 are repeated until a stop condition is met (e.g., a maximum number of iterations is reached or the penalty is below a predetermined threshold).
The method provided by the invention effectively combines the expression capacity of the neural network and the global searching capacity of the bee colony optimization algorithm, and improves the performance of the data cleaning task. The chaos mapping increases the diversity of searching, and the selection strategy based on the distance enables the searching to be focused more, so that the optimal solution can be found more quickly.
4) Model evaluation and tuning: and evaluating the performance of the model by using the verification data set, and performing model tuning, such as adjusting model super-parameters, adopting an integrated learning method and the like.
Step three: application and evaluation of cleaning results;
the trained machine learning model can be applied to unwashed data sets to automatically detect and correct errors and anomalies in the data. The cleaning results can be used for subsequent data analysis, mining and decision making. Evaluation of the cleaning result is a key step of ensuring the cleaning effect. The invention evaluates the data quality, which comprises the following main factors: integrity, consistency, timeliness, accuracy, validity, and uniqueness.
First, an evaluation function is defined for each factor, and then an overall data quality score is calculated from these functions. This overall score can be used to measure the quality of the entire dataset. The following are descriptions of possible evaluation functions for each factor and the formula derivation:
integrity: the number of null values in the dataset is measured. For a given dataset, the integrity score may be calculated by the following formula:
c= (1- (number of null values/number of total values)) ×100;
consistency: the amount of contradictory data in the dataset is measured. The consistency score may be calculated by the following formula:
cons= (1- (contradictory data amount/total data amount)). Times.100;
timeliness: the degree of freshness of the data is measured. This can be calculated by comparing the date of generation of the data with the current date. The formula is as follows:
t=91- ((current date-date of data generation)/maximum acceptable date difference)) ×100;
accuracy: the accuracy of the data is measured. The accuracy score may be calculated by the following formula:
a= (1- (error data number/total data number)) ×100;
effectiveness is as follows: whether the data meets predefined rules, constraints, or definitions is measured. The formula is as follows:
v= (1- (number of data not conforming to rule/total number of data)) ×100;
uniqueness: the data is measured for duplicate. The uniqueness score may be calculated by the following formula:
u= (1- (number of duplicate data/total number of data)) ×100;
the range of each index is 0-100, and the higher the numerical value is, the better the index is.
A formula is then used to integrate the scores of these factors into an overall data quality score. A weight value is set for each factor and then the overall score is calculated by means of a weighted average. This formula is shown below:
Q=w1×C+w2×Cons+w3×T+w4×A+w5×V+w6×U;
where Q is the overall data quality score, C, cons, T, A, V, U represents the scores for integrity, consistency, timeliness, accuracy, validity, and uniqueness, respectively. w1, w2, w3, w4, w5, w6 are weight values for each factor, the setting of these weight values depends on how much you pay attention to each factor.
The specific implementation manner of this embodiment may be referred to the related description in the foregoing embodiment, which is not repeated herein.
It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.
It should be noted that in the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present invention, unless otherwise indicated, the meaning of "plurality" means at least two.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution device. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or part of the steps carried out in the method of the above embodiments may be implemented by a program to instruct related hardware, and the corresponding program may be stored in a computer readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented as software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The invention provides a data cleaning method based on artificial intelligence in a credit-created environment, which effectively combines a machine learning algorithm and a bee colony optimization neural network so as to improve the efficiency and accuracy of data cleaning. The main technical effects include:
1. data cleaning efficiency and accuracy are improved: by using a machine learning algorithm to automatically detect and correct errors, anomalies and inconsistencies in the data, the invention significantly improves the efficiency and accuracy of data cleaning. In particular, the application of the swarm optimization neural network combines the expression capability of the neural network and the global searching capability of the swarm optimization algorithm, thereby providing higher performance for the data cleaning task.
2. The manual intervention is reduced: the invention reduces the need for manual intervention, reduces the labor intensity and time cost of data cleaning, and avoids human errors.
3. Ensuring data quality and consistency: through comprehensive data quality evaluation, the invention ensures the integrity, consistency, timeliness, accuracy, effectiveness and uniqueness of the data, and further ensures the quality and consistency of the data.
4. Has universality and expandability: the method is not only suitable for specific data or specific data cleaning tasks, but also has good expandability, and can be applied to wider data types and cleaning tasks.
5. Facilitating data analysis and decision making: the cleaned data is more accurate and consistent, and can better support subsequent data analysis, data mining and decision making, thereby improving service efficiency and decision quality.
The method, the device, the processor and the computer readable storage medium thereof for realizing data cleaning based on artificial intelligence in the information creation environment effectively combine a machine learning algorithm and a bee colony optimization neural network so as to improve the efficiency and the accuracy of data cleaning. The method can improve the efficiency and accuracy of data cleaning, ensure the quality and consistency of data, and reduce manual intervention so as to support more accurate data analysis and decision.
In this specification, the invention has been described with reference to specific embodiments thereof. It will be apparent, however, that various modifications and changes may be made without departing from the spirit and scope of the invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims (11)

1. The method for realizing data cleaning based on artificial intelligence in a credit-invasive environment is characterized by comprising the following steps:
(1) Preprocessing data and carrying out characteristic engineering;
(2) Building and training a machine learning model;
(3) Application and evaluation of the cleaning result were performed.
2. The method for realizing data cleaning based on artificial intelligence in a signal-based environment according to claim 1, wherein the data preprocessing in the step (1) specifically comprises data cleaning, noise and abnormal value removal and missing value processing.
3. The method for implementing data cleaning based on artificial intelligence in a signal-based environment according to claim 1, wherein the feature engineering of the step (1) is specifically: the appropriate features are selected and encoded.
4. The method for implementing data cleansing based on artificial intelligence in a signal-based environment according to claim 1, wherein the step (2) specifically comprises the following steps:
(2.1) marking data, and preparing a marked data set;
(2.2) extracting features from the raw data as input to a machine learning model;
(2.3) selecting a proper machine learning model according to the characteristics of the cleaning task, performing model training by using the labeling data set, and optimizing model parameters;
(2.4) evaluating the performance of the model using the validation dataset and performing model tuning.
5. The method for implementing data cleansing based on artificial intelligence in a signal-based environment according to claim 4, wherein the step (2.3) uses an improved swarm optimization neural network, and updates the weights and deviations of the neural network at each iteration, specifically comprising the steps of:
(2.3.1) initializing the location of bees, i.e. the weights and deviations of the neural network;
(2.3.2) assessing fitness of each bee by the performance of the neural network on the training data;
(2.3.3) selecting a new location according to the chaotic map and the distance-based selection strategy;
(2.3.4) if the new weights and deviations result in lower losses, updating the position of the bees;
(2.3.5) repeating steps (2.3.2) to (2.3.4) until the stop condition is satisfied.
6. The method for implementing data cleansing based on artificial intelligence in a signal-based environment according to claim 1, wherein the step (3) specifically comprises the following steps:
(3.1) defining an evaluation function for each factor of the data quality evaluation;
(3.2) calculating a score corresponding to each factor by an evaluation function of each factor;
(3.3) calculating an overall data quality score based on the scores of all factors;
(3.4) measuring the quality of the entire data set by the overall data quality score.
7. The method of claim 6, wherein the step (3.2) comprises calculating an integrity score C, specifically:
the integrity score C is calculated according to the following formula:
c= (1- (number of null values/number of total values)) ×100;
the step (3.2) comprises calculating a consistency score Cons, specifically:
the consistency score Cons is calculated according to the following formula:
cons= (1- (contradictory data amount/total data amount)). Times.100;
the step (3.2) comprises calculating a timeliness score T, specifically:
the timeliness score T is calculated according to the following formula:
t= (1- ((current date-date of data generation)/maximum acceptable date difference)) ×100;
the step (3.2) comprises calculating an accuracy score A, specifically:
the accuracy score a is calculated according to the following formula:
a= (1- (error data number/total data number)) ×100;
the step (3.2) comprises calculating a effectiveness score V, specifically:
the effectiveness score V is calculated according to the following formula:
v= (1- (number of data not conforming to rule/total number of data)) ×100;
the step (3.2) comprises calculating a uniqueness score U, specifically:
the uniqueness score U is calculated according to the following formula:
u= (1- (number of duplicate data/number of total data)) ×100.
8. The method for implementing data cleansing based on artificial intelligence in a signal-based environment according to claim 1, wherein the step (3.3) calculates an overall data quality score, specifically:
the overall data quality score is calculated according to the following formula:
Q=w1×C+w2×Cons+w3×T+w4×A+w5×V+w6×U;
where Q is the overall data quality score, C, cons, T, A, V, U represents the integrity score, consistency score, timeliness score, accuracy score, validity score, and uniqueness score, respectively, w1, w2, w3, w4, w5, w6 are the weight values of each factor.
9. An apparatus for implementing artificial intelligence based data cleansing in a signal-based environment, said apparatus comprising:
a processor configured to execute computer-executable instructions;
a memory storing one or more computer-executable instructions which, when executed by the processor, perform the steps of the method of implementing data cleansing based on artificial intelligence in a signal-based environment as claimed in any one of claims 1 to 8.
10. A processor for implementing artificial intelligence based data cleansing in a trafficking environment, the processor being configured to execute computer executable instructions which, when executed by the processor, implement the steps of the method for implementing artificial intelligence based data cleansing in a trafficking environment according to any one of claims 1 to 8.
11. A computer readable storage medium having stored thereon a computer program executable by a processor to perform the steps of the method for implementing data cleansing based on artificial intelligence in a trafficking environment according to any one of claims 1 to 8.
CN202310818372.4A 2023-07-04 2023-07-04 Method, device, processor and computer readable storage medium for realizing data cleaning based on artificial intelligence in information creation environment Pending CN116701378A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310818372.4A CN116701378A (en) 2023-07-04 2023-07-04 Method, device, processor and computer readable storage medium for realizing data cleaning based on artificial intelligence in information creation environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310818372.4A CN116701378A (en) 2023-07-04 2023-07-04 Method, device, processor and computer readable storage medium for realizing data cleaning based on artificial intelligence in information creation environment

Publications (1)

Publication Number Publication Date
CN116701378A true CN116701378A (en) 2023-09-05

Family

ID=87841112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310818372.4A Pending CN116701378A (en) 2023-07-04 2023-07-04 Method, device, processor and computer readable storage medium for realizing data cleaning based on artificial intelligence in information creation environment

Country Status (1)

Country Link
CN (1) CN116701378A (en)

Similar Documents

Publication Publication Date Title
EP1589473A2 (en) Using tables to learn trees
CN111340227A (en) Method and device for compressing business prediction model through reinforcement learning model
CN113792937A (en) Social network influence prediction method and device based on graph neural network
CN112100403A (en) Knowledge graph inconsistency reasoning method based on neural network
CN117349782B (en) Intelligent data early warning decision tree analysis method and system
CN113449919B (en) Power consumption prediction method and system based on feature and trend perception
CN114998691A (en) Semi-supervised ship classification model training method and device
CN116340726A (en) Energy economy big data cleaning method, system, equipment and storage medium
CN104732067A (en) Industrial process modeling forecasting method oriented at flow object
CN113554302A (en) Production management method and system based on MES intelligent manufacturing
CN110766201A (en) Revenue prediction method, system, electronic device, computer-readable storage medium
Goel et al. Dynamically adaptive and diverse dual ensemble learning approach for handling concept drift in data streams
US11989656B2 (en) Search space exploration for deep learning
KR101827124B1 (en) System and Method for recognizing driving pattern of driver
KR20220014744A (en) Data preprocessing system based on a reinforcement learning and method thereof
CN111524023A (en) Greenhouse adjusting method and system
Liu et al. Residual useful life prognosis of equipment based on modified hidden semi-Markov model with a co-evolutional optimization method
CN116701378A (en) Method, device, processor and computer readable storage medium for realizing data cleaning based on artificial intelligence in information creation environment
CN116720079A (en) Wind driven generator fault mode identification method and system based on multi-feature fusion
CN114780619B (en) Abnormity early warning method for automatic engineering audit data
CN113139332A (en) Automatic model construction method, device and equipment
JP4230890B2 (en) Model identification device, model identification program, and method of operating model identification device
Whitehouse et al. Tree sequences as a general-purpose tool for population genetic inference
KR102636461B1 (en) Automated labeling method, device, and system for learning artificial intelligence models
JP7154468B2 (en) Information processing device, information processing method and information processing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination