CN111241258A

CN111241258A - Data cleaning method and device, computer equipment and readable storage medium

Info

Publication number: CN111241258A
Application number: CN202010016777.2A
Authority: CN
Inventors: 李渊; 杨铭; 刘设伟
Original assignee: Taikang Insurance Group Co Ltd; Taikang Online Property Insurance Co Ltd
Current assignee: Taikang Insurance Group Co Ltd; Taikang Online Property Insurance Co Ltd
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2020-06-05

Abstract

The embodiment of the invention provides a data cleaning method, a data cleaning device, computer equipment and a readable storage medium, wherein the method comprises the following steps: acquiring data to be cleaned, and aiming at each knowledge point, respectively forming a knowledge point pair by the main knowledge point problem and each sub knowledge point problem; aiming at each knowledge point pair, respectively inputting a main knowledge point problem and a sub knowledge point problem into a bert pre-training model, outputting a context semantic feature vector sequence of the main knowledge point problem, and outputting a context semantic feature vector sequence of the sub knowledge point problem; aiming at each knowledge point pair, inputting a context semantic feature vector sequence of a main knowledge point problem and a context semantic feature vector sequence of a sub knowledge point problem into an attention mechanism model, and outputting a semantic matching value of the knowledge point pair according to the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem; and determining whether the knowledge point pair is dirty data or not according to the semantic matching degree.

Description

Data cleaning method and device, computer equipment and readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a data cleaning method, a data cleaning device, computer equipment and a readable storage medium.

Background

With the rapid development of the internet industry over the years, more and more internet platforms have begun to use intelligent question-answering systems to serve on-line customer consulting services. However, the use effect of the intelligent question-answering system depends on the underlying knowledge base, and knowledge points sorted according to the online real-time data are stored in the knowledge base. The knowledge base has a plurality of knowledge points, and each knowledge point comprises a plurality of sub knowledge points and corresponding answers. The intelligent question-answering system will output the best answer from the knowledge base according to the customer's question.

Therefore, for the intelligent question-answering system, the accuracy of the knowledge base determines the application effect of the intelligent question-answering system. Meanwhile, dirty data exist in a knowledge base which is arranged according to online data, and the method is very important for verifying knowledge points of the knowledge base and ensuring the correctness of answers. Therefore, data detection and cleaning are needed for the knowledge base.

At present, the data detection and cleaning method has insufficient text semantic understanding capacity, so that the accuracy of data detection and cleaning is influenced.

Disclosure of Invention

The embodiment of the invention provides a data cleaning method, which aims to solve the technical problem of low accuracy in data cleaning in the prior art. The method comprises the following steps:

acquiring data to be cleaned, and aiming at each knowledge point, respectively forming a knowledge point pair by the main knowledge point problem and each sub knowledge point problem;

aiming at each knowledge point pair, respectively inputting a main knowledge point problem and a sub knowledge point problem into a bert pre-training model, outputting a context semantic feature vector sequence of the main knowledge point problem, and outputting a context semantic feature vector sequence of the sub knowledge point problem, wherein the bert pre-training model is obtained by training based on sample data of data to be cleaned;

aiming at each knowledge point pair, inputting a context semantic feature vector sequence of a main knowledge point problem and a context semantic feature vector sequence of a sub knowledge point problem into an attention mechanism model, and outputting a semantic matching value of the knowledge point pair according to the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem, wherein the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem comprise feature information representing the global importance degree of a feature vector;

and determining whether the knowledge point pair is dirty data or not according to the semantic matching degree.

The embodiment of the invention also provides a data cleaning device, which is used for solving the technical problem of low accuracy in data cleaning in the prior art. The device includes:

the data acquisition module is used for acquiring data to be cleaned and respectively forming knowledge point pairs by the main knowledge point problem and each sub knowledge point problem aiming at each knowledge point;

the vector extraction module is used for respectively inputting the main knowledge point problem and the sub knowledge point problem into a bert pre-training model aiming at each knowledge point pair, outputting a context semantic feature vector sequence of the main knowledge point problem and outputting a context semantic feature vector sequence of the sub knowledge point problem, wherein the bert pre-training model is obtained by training based on sample data of data to be cleaned;

the semantic matching degree calculation module is used for inputting the context semantic feature vector sequence of the main knowledge point problem and the context semantic feature vector sequence of the sub knowledge point problem into an attention mechanism model aiming at each knowledge point pair, and outputting a semantic matching degree value of the knowledge point pair according to the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem, wherein the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem comprise feature information representing the global importance degree of a feature vector;

and the data cleaning module is used for determining whether the knowledge point pair is dirty data or not according to the semantic matching degree.

The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the random data cleaning method when executing the computer program so as to solve the technical problem of low accuracy in data cleaning in the prior art.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program for executing any data cleaning method is stored in the computer-readable storage medium, so as to solve the technical problem in the prior art that data cleaning is low in accuracy.

In the embodiment of the invention, a knowledge point pair is provided by respectively forming a main knowledge point problem and each sub knowledge point problem, a context semantic feature vector sequence of the main knowledge point problem and a context semantic feature vector sequence of the sub knowledge point problem in the knowledge point pair are extracted based on a bert pre-training model, the context semantic feature vector sequence is extracted based on the deep learning semantic understanding capability, an attention mechanism model is further adopted to process based on the context semantic feature vector sequence of the main knowledge point problem and the context semantic feature vector sequence of the sub knowledge point problem, a semantic matching value of the knowledge point pair is output according to the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem, and the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem comprise feature information representing the global importance degree of a feature vector, compared with the prior art, the data cleaning method detects and cleans data based on the semantic understanding capability of deep learning, is beneficial to improving the accuracy of data cleaning, is beneficial to improving the processing efficiency of data detection and cleaning, and is beneficial to reducing the input cost of manpower and material resources.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flow chart of a data cleansing method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a computer device according to an embodiment of the present invention;

fig. 3 is a block diagram of a data cleansing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

In an embodiment of the present invention, a data cleansing method is provided, as shown in fig. 1, the method including:

step 102: acquiring data to be cleaned, and aiming at each knowledge point, respectively forming a knowledge point pair by the main knowledge point problem and each sub knowledge point problem;

step 104: aiming at each knowledge point pair, respectively inputting a main knowledge point problem and a sub knowledge point problem into a bert pre-training model, outputting a context semantic feature vector sequence of the main knowledge point problem, and outputting a context semantic feature vector sequence of the sub knowledge point problem, wherein the bert pre-training model is obtained by training based on sample data of data to be cleaned;

step 106: aiming at each knowledge point pair, inputting a context semantic feature vector sequence of a main knowledge point problem and a context semantic feature vector sequence of a sub knowledge point problem into an attention mechanism model, and outputting a semantic matching value of the knowledge point pair according to the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem, wherein the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem comprise feature information representing the global importance degree of a feature vector;

step 108: and determining whether the knowledge point pair is dirty data or not according to the semantic matching degree.

It can be known from the flow shown in fig. 1 that, in the embodiment of the present invention, a knowledge point pair is proposed to respectively constitute a main knowledge point problem and each sub knowledge point problem, a context semantic feature vector sequence of the main knowledge point problem and a context semantic feature vector sequence of the sub knowledge point problem in the knowledge point pair are extracted based on a bert pre-training model, so as to realize extraction of the context semantic feature vector sequence based on deep learning, and further an attention mechanism model is adopted to process based on the context semantic feature vector sequence of the main knowledge point problem and the context semantic feature vector sequence of the sub knowledge point problem, and a semantic matching value of the knowledge point pair is output according to the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem, and the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem include feature information representing the global importance degree of the feature vector, compared with the prior art, the data cleaning method detects and cleans data based on the semantic understanding capability of deep learning, is beneficial to improving the accuracy of data cleaning, is beneficial to improving the processing efficiency of data detection and cleaning, and is beneficial to reducing the input cost of manpower and material resources.

In specific implementation, the data cleaning method can be used for cleaning the knowledge point data, for example, can be used for cleaning the data of a knowledge base.

In specific implementation, the above-mentioned main knowledge point problem may be a problem in a standard form for a certain knowledge point, the sub-knowledge point problem may be a problem in a non-standard form for a certain knowledge point, and for a certain knowledge point, there may be a main knowledge point problem a and a plurality of sub-knowledge point problems a, b, c, d … …, etc., for example, taking an insurance application scenario as an example, the main knowledge point problem may be: asking for a question, can you buy insurance? The sub-knowledge point problem may be: i want to buy insurance, the child knowledge point problem could also be: what classes of insurance are there? . Therefore, for a knowledge point, the main knowledge point problem and each sub knowledge point problem constitute a knowledge point pair, i.e. there may be multiple knowledge point pairs, for example, there may be (a, a), (a, b), (a, c) … …, etc.

In specific implementation, before the data cleaning method is implemented, deep learning training can be performed on a neural network or various machine learning components by using historical data of application scenes to obtain the bert pre-training model and the attention mechanism model, so that the data cleaning method has the convenience of performing transverse extension on the application field, for example, the bert pre-training model and the attention mechanism model are trained by using corresponding application scene data as samples in which application scenes the data cleaning method needs to be applied to. The samples comprise positive samples and negative samples, for example, the positive samples can be knowledge point pairs matched with a main knowledge point problem and a sub knowledge point problem, and the semantic matching value can be set to 1; the negative sample can be a knowledge point pair with unmatched main knowledge point problem and sub knowledge point problem, and the semantic matching value can be set to be 0; the bert pre-training model and the attention mechanism model are obtained by repeatedly training positive samples and negative samples. The bert pre-training model and the attention mechanism model with strong semantic understanding capability can be obtained by training based on a small amount of sample data.

In specific implementation, the trained bert pre-training model is used as an embedding layer, token embedding coding is carried out on tokens in each sentence, and the primary knowledge point problem and the sub-knowledge point problem generate a word vector sequence through the bert pre-training model. BERT parameters are trained in the training process of the BERT pre-training model, fine-tuning _ tuning of the BERT pre-training model can be accessed to a subsequent interactive semantic understanding model, and a word vector sequence generated by the BERT pre-training model can be subjected to neural network training through a Bi-directional long-time and short-time memory network Bi-LSTM neural network.

In specific implementation, the trained bert pre-training model is used to extract the context semantic feature vector sequence based on the semantic understanding ability of deep learning, for example, for each knowledge point pair, the main knowledge point problem and the sub knowledge point problem are respectively input into the bert pre-training model, the context semantic feature vector sequence of the main knowledge point problem is output, and the context semantic feature vector sequence of the sub knowledge point problem is output, which includes:

problem q of main knowledge point_iAnd child knowledge point problem q_in(A sub-knowledge point problem set can be represented as Q_i＝{q_i1,q_i2,....,q_in}) respectively inputting a bert pre-training model, wherein the bert pre-training model executes the following steps:

outputting a master knowledge point problem q_iHaving a word vector sequence of fixed length, outputting each sub-knowledge point problem q_inA word vector sequence of fixed length;

aiming at the knowledge point pairs with the same length between the word vector sequence of the sub-knowledge point problem and the word vector sequence of the main knowledge point problem, the word vector sequence of the main knowledge point problem is input into a bidirectional long-time memory network, and the context semantic feature vector sequence of the main knowledge point problem is output

Inputting the word vector sequence of the sub-knowledge point problem into a bidirectional long-time memory network, and outputting the context semantic feature vector sequence of the sub-knowledge point problem

In specific implementation, a Bi-directional long-time and short-time memory network Bi-LSTM unit can be adopted in the trained bert pre-training model to extract context semantic feature vector sequences of the main knowledge point problem and the sub-knowledge point problem. In particular, in Bi-LSIn the TM neural network, aiming at each time t, vector sequence output h output by two long and short memory network LSTM units for splicing a forward text word vector sequence and a reverse text word vector sequence_fwAnd h_bwAnd outputting the final characteristic vector at the time t of the Bi-LSTM neural network, wherein the dimension of the characteristic vector is 2 times of that of the characteristic vector output by the LSTM unit.

h_t＝[h_fw,h_bw]

Wherein h is_fwRepresenting the output of an LSTM unit processing a sequence of text word vectors in positive order, h_bwRepresenting the output of an LSTM unit processing a sequence of vectors of reversed text words, h_tA feature vector sequence representing the output of the Bi-LSTM network (i.e., a feature vector at time t in a context semantic feature vector sequence representing the main knowledge point problem and the sub knowledge point problem).

In specific implementation, in order to further enhance the semantic understanding ability during the data cleaning process, in this embodiment, a trained attention mechanism model is used to calculate the semantic matching value of a knowledge point pair, for example, for each knowledge point pair, inputting the context semantic feature vector sequence of the main knowledge point problem and the context semantic feature vector sequence of the sub knowledge point problem into the attention mechanism model, and outputting the semantic matching value of the knowledge point pair according to the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem, including:

problem q of main knowledge point_iContext semantic feature vector sequence of

And child knowledge point problem q_inContext semantic feature vector sequence of

Inputting an attention mechanism model, and executing the following steps through the attention mechanism model:

calculating attention weight a of each feature vector in context semantic feature vector sequence of main knowledge point problem_qtContext language for calculating sub-knowledge point problemDefining attention weights for each feature vector in a sequence of feature vectors

Taking the attention weight as the feature information, and carrying out corresponding attention weight a on each feature vector in the context semantic feature vector sequence of the main knowledge point problem_qtWeighting to obtain a characteristic vector sequence S of the main knowledge point problem_qPerforming corresponding attention weight on each feature vector in the context semantic feature vector sequence of the sub-knowledge point problem

Weighting to obtain the characteristic vector sequence of the sub-knowledge point problem

Specifically, the attention weight weighting corresponding to each feature vector in the context semantic feature vector sequence of the main knowledge point problem and the context semantic feature vector sequence of the sub knowledge point problem can be performed through the following formula: s_t＝a_th_t，h_tThe feature vector (i.e. each feature vector in the context semantic feature vector sequence) a output for each time t of the bidirectional long-short time memory network_t(collectively represent a)_qtAnd

) Attention weight, s, of the feature vector output for the time t_tNew feature vectors for the weighted words at time t.

And outputting the semantic matching value of the knowledge point pair according to the characteristic vector sequence of the main knowledge point problem and the characteristic vector sequence of the sub knowledge point problem.

In specific implementation, the attention mechanism model executes the following steps to calculate the attention weight of each feature vector in the context semantic feature vector sequence of the main knowledge point problem, and calculate the attention weight of each feature vector in the context semantic feature vector sequence of the sub knowledge point problem:

problem q of main knowledge point_iContext semantic feature vector sequence of

Performing vector splicing on the feature vectors of the state at the last moment to obtain background information; namely, the problem q of the main knowledge point in each knowledge point pair_iContext semantic feature vector sequence of

The eigenvector and sub-knowledge point problem q of the last moment state of (1)_inContext semantic feature vector sequence of

And carrying out vector splicing on the feature vectors of the last moment state to obtain background information, wherein the background information comprises semantic feature vectors of all time states before the main knowledge point problem and the sub knowledge point problem.

Reducing the dimension of the background information to half; the dimensionality of the background information is reduced to be consistent with a context semantic feature vector sequence output by a bidirectional long-time memory network, the function can be realized through a full connection layer of an attention mechanism model, and the background information after dimensionality reduction is expressed as bkg.

Calculating context semantic feature vector sequence of the background information and main knowledge point problem

Similarity values among feature vectors of each moment in the master knowledge point problem, and similarity values corresponding to all feature vectors in the context semantic feature vector sequence of the master knowledge point problem form a similarity vector Sim of the master knowledge point problem_qCalculating the context semantic feature vector sequence of the background information and sub-knowledge point problem

Similarity values between feature vectors of every moment in the sub-knowledge point problem, and similarity values corresponding to all feature vectors in the context semantic feature vector sequence of the sub-knowledge point problem form the similarity vector of the sub-knowledge point problem

Specifically, the feature vector h output by the background information and the bidirectional long-and-short time memory network at each time t can be calculated by the following formula_tSimilarity value sim of_t：sim_t＝bkg·h_t。

Similarity vector Sim according to the master knowledge point problem_qCalculating attention weight of each feature vector in context semantic feature vector sequence of main knowledge point problem, and calculating attention weight of each feature vector according to similarity vector of sub knowledge point problem

Calculating the attention weight of each feature vector in the context semantic feature vector sequence of the sub-knowledge point problem; specifically, a softmax calculation mode can be introduced into the attention mechanism model to complete the function, numerical conversion is performed on the similarity vectors, the weight of global important text feature information can be highlighted through an intrinsic mechanism of softmax while data are normalized, and the attention weight of each feature vector in a context semantic feature vector sequence of a principal knowledge point problem and a context semantic feature vector sequence of a sub-knowledge point problem can be calculated through the following formulas:

wherein, a_tAn attention weight representing a feature vector at time t; sim_tAnd representing the similarity value corresponding to the feature vector at the time t, wherein N represents the total time, and is equal to the vocabulary number of the text.

In specific implementation, the attention mechanism model outputs the semantic matching value of the knowledge point pair according to the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem by executing the following steps:

calculating the similarity value between each feature vector in the feature vector sequence of the main knowledge point problem and each feature vector in the feature vector sequence of the sub knowledge point problem, wherein the similarity values form a similarity matrix; specifically, the similarity value can be calculated by the following formula:

wherein s is_qiFeature vector sequence S representing a problem of a master knowledge point_qThe ith feature vector of (a) is,

feature vector sequence representing sub-knowledge point problem

The (j) th feature vector of (a),

representing a feature vector s_qiAnd

the similarity values form a similarity matrix sim.

According to the descending order of the similarity values, feature vectors corresponding to a plurality of preset similarity values are taken to form semantic matching feature vectors of the knowledge point pair; specifically, the trained attention mechanism model can complete the function through K-MAX Pooling, and feature vectors corresponding to larger K similarity values in the similarity matrix sim are selected to form semantic matching feature vectors, and the semantic matching feature vectors represent knowledge point pairs to perform text semantic matching.

And inputting the semantic matching feature vector into a full connection layer of the attention mechanism model, outputting a semantic matching value of the knowledge point pair, specifically, finishing the function of the trained attention mechanism model through the full connection layer, finally performing secondary classification of text semantic matching by using a softmax classifier, and outputting the semantic matching value of the knowledge point pair, wherein different semantic matching values represent different judgment results. When the attention mechanism model is trained, a gradient descent method can be used for training weights of obtained prediction results (matching and mismatching), and the neural network model with the best training effect is stored for an intelligent knowledge base system.

In specific implementation, the semantic matching degree value M of the output knowledge point pair may be a numerical value from 0 to 1, and when the semantic matching degree M is greater than a set threshold value

When the semantic matching degree M is smaller than a set threshold value, the semantic matching degree M is smaller than the set threshold value

And when the data is not matched, the data is considered as dirty data, and the data can be directly returned to the working pool, and the correct answer corresponding to the dirty data matching needs to be further audited and cleaned manually. Therefore, the risk of decline of the sparrow rate of the intelligent question-answering system caused by the non-standard construction and arrangement of the knowledge base can be effectively avoided, the intelligent question-answering system can be matched with the corresponding problems, and correct answers are returned.

In specific implementation, the data cleaning method can be used for a question and answer knowledge base of an online intelligent question and answer robot or system. For example, taking an insurance application scenario as an example, aiming at an online intelligent question-answering robot or system serving a micro-insurance channel, in the face of insurance business consultation services of a large number of micro-credit users, a question-answering knowledge base can be enriched by continuously collecting online business data. With the expansion of business volume, the problem that the question-answer knowledge base has dirty data and huge quantity, and the efficiency is low and the effect is not obvious only by manual arrangement. The data cleaning method is used for carrying out data detection and cleaning on the question and answer knowledge base, migration learning is carried out on a BERT pre-training model based on small data, meanwhile, global feature information is extracted by combining an attention mechanism model, the text semantic understanding capacity is greatly improved, the convenience and the high efficiency of the question and answer knowledge base detection and cleaning are improved by using the data cleaning method, the performance of an on-line intelligent question and answer system is ensured to be optimal, and the method has convenience in transverse expansion in the insurance field.

In this embodiment, a computer device is provided, as shown in fig. 2, and includes a memory 202, a processor 204, and a computer program stored on the memory and executable on the processor, and the processor implements any of the data cleansing methods described above when executing the computer program.

In particular, the computer device may be a computer terminal, a server or a similar computing device.

In the present embodiment, there is provided a computer-readable storage medium storing a computer program for executing any of the data cleansing methods described above.

In particular, computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer-readable storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable storage medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

Based on the same inventive concept, the embodiment of the present invention further provides a data cleaning apparatus, as described in the following embodiments. Because the principle of solving the problems of the data cleaning device is similar to that of the data cleaning method, the implementation of the data cleaning device can refer to the implementation of the data cleaning method, and repeated parts are not described again. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 3 is a block diagram of a data cleansing apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus includes:

the data acquisition module 302 is configured to acquire data to be cleaned, and for each knowledge point, form a knowledge point pair from a main knowledge point problem and each sub-knowledge point problem respectively;

the vector extraction module 304 is configured to input the main knowledge point problem and the sub knowledge point problem into a bert pre-training model respectively for each knowledge point pair, output a context semantic feature vector sequence of the main knowledge point problem, and output a context semantic feature vector sequence of the sub knowledge point problem, where the bert pre-training model is obtained by training based on sample data of data to be cleaned;

a semantic matching degree calculation module 306, configured to input, for each knowledge point pair, a context semantic feature vector sequence of the main knowledge point problem and a context semantic feature vector sequence of the sub knowledge point problem into an attention mechanism model, and output a semantic matching degree value of the knowledge point pair according to the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem, where the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem include feature information indicating a global importance degree of a feature vector;

and the data cleaning module 308 is configured to determine whether the knowledge point pair is dirty data according to the semantic matching degree.

In an embodiment, the vector extraction module is specifically configured to, for each knowledge point pair, input a main knowledge point problem and a sub knowledge point problem into a bert pre-training model respectively, and execute the following steps through the bert pre-training model:

respectively inputting the main knowledge point problem and the sub knowledge point problem into a bert pre-training model, outputting a word vector sequence with fixed length of the main knowledge point problem, and outputting a word vector sequence with fixed length of each sub knowledge point problem;

aiming at the knowledge point pairs with the same length between the word vector sequence of the sub-knowledge point problem and the word vector sequence of the main knowledge point problem, the word vector sequence of the main knowledge point problem is input into a bidirectional long-short time memory network, the context semantic feature vector sequence of the main knowledge point problem is output, the word vector sequence of the sub-knowledge point problem is input into the bidirectional long-short time memory network, and the context semantic feature vector sequence of the sub-knowledge point problem is output.

In an embodiment, the semantic matching degree calculation module is specifically configured to input the context semantic feature vector sequence of the main knowledge point problem and the context semantic feature vector sequence of the sub knowledge point problem into an attention mechanism model, and execute the following steps through the attention mechanism model:

calculating the attention weight of each feature vector in the context semantic feature vector sequence of the main knowledge point problem, and calculating the attention weight of each feature vector in the context semantic feature vector sequence of the sub knowledge point problem;

taking the attention weight as the feature information, carrying out corresponding attention weight weighting on each feature vector in the context semantic feature vector sequence of the main knowledge point problem to obtain a feature vector sequence of the main knowledge point problem, and carrying out corresponding attention weight weighting on each feature vector in the context semantic feature vector sequence of the sub knowledge point problem to obtain a feature vector sequence of the sub knowledge point problem;

In one embodiment, the semantic matching degree calculation module calculates the attention weight of each feature vector in the context semantic feature vector sequence of the main knowledge point problem and calculates the attention weight of each feature vector in the context semantic feature vector sequence of the sub knowledge point problem by the attention mechanism model according to the following steps:

performing vector splicing on the feature vector of the last moment state in the context semantic feature vector sequence of the main knowledge point problem and the context semantic feature vector sequence of the sub knowledge point problem to obtain background information;

reducing the dimension of the background information to half;

calculating the similarity value between the background information and the feature vector of each moment in the context semantic feature vector sequence of the main knowledge point problem, wherein the similarity value corresponding to each feature vector in the context semantic feature vector sequence of the main knowledge point problem forms the similarity vector of the main knowledge point problem, calculating the similarity value between the background information and the feature vector of each moment in the context semantic feature vector sequence of the sub knowledge point problem, and the similarity value corresponding to each feature vector in the context semantic feature vector sequence of the sub knowledge point problem forms the similarity vector of the sub knowledge point problem;

and calculating the attention weight of each feature vector in the context semantic feature vector sequence of the main knowledge point problem according to the similarity vector of the main knowledge point problem, and calculating the attention weight of each feature vector in the context semantic feature vector sequence of the sub knowledge point problem according to the similarity vector of the sub knowledge point problem.

In one embodiment, the semantic matching degree calculation module calculates the similarity value between the background information and the feature vector at each moment by using the following formula through the attention mechanism model:

sim_t＝bkg·h_t

wherein bkg represents background information, h_tFeature vector, sim, representing time t_tFeature vector h representing background information and time t_tThe similarity value between them.

In one embodiment, the semantic matching degree calculating module calculates the attention weight of the feature vector by the following formula:

wherein, a_tAn attention weight representing a feature vector at time t; sim_tThe similarity value corresponding to the characteristic vector of t time is shown, N is the time totalAnd (4) counting.

In one embodiment, the semantic matching degree calculating module is further configured to calculate a similarity value between each feature vector in the feature vector sequence of the main knowledge point problem and each feature vector in the feature vector sequence of the sub knowledge point problem, and each similarity value constitutes a similarity matrix;

according to the descending order of the similarity values, feature vectors corresponding to a plurality of preset similarity values are taken to form semantic matching feature vectors of the knowledge point pair;

and inputting the semantic matching feature vector into a full connection layer of the attention mechanism model, and outputting a semantic matching value of the knowledge point pair.

The embodiment of the invention realizes the following technical effects: the method comprises the steps of respectively forming a knowledge point pair by a main knowledge point problem and each sub-knowledge point problem, extracting a context semantic feature vector sequence of the main knowledge point problem and a context semantic feature vector sequence of the sub-knowledge point problem from the knowledge point pair based on a bert pre-training model, realizing the extraction of the context semantic feature vector sequence based on the semantic understanding capability of deep learning, further processing the context semantic feature vector sequence of the main knowledge point problem and the context semantic feature vector sequence of the sub-knowledge point problem by adopting an attention machine model, outputting a semantic matching value of the knowledge point pair according to the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub-knowledge point problem, and enabling the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub-knowledge point problem to comprise feature information representing the global importance degree of a feature vector, compared with the prior art, the data cleaning method detects and cleans data based on the semantic understanding capability of deep learning, is beneficial to improving the accuracy of data cleaning, is beneficial to improving the processing efficiency of data detection and cleaning, and is beneficial to reducing the input cost of manpower and material resources.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for data cleansing, comprising:

2. The data cleaning method of claim 1, wherein for each knowledge point pair, inputting a principal knowledge point problem and a sub-knowledge point problem into a bert pre-training model, respectively, outputting a context semantic feature vector sequence of the principal knowledge point problem, and outputting a context semantic feature vector sequence of the sub-knowledge point problem, comprises:

3. The data cleaning method of claim 1 or 2, wherein for each knowledge point pair, inputting the context semantic feature vector sequence of the main knowledge point problem and the context semantic feature vector sequence of the sub knowledge point problem into an attention mechanism model, and outputting the semantic matching value of the knowledge point pair according to the feature vector sequence of the main knowledge point problem and the feature vector sequence of the sub knowledge point problem, comprises:

inputting the context semantic feature vector sequence of the main knowledge point problem and the context semantic feature vector sequence of the sub knowledge point problem into an attention mechanism model, and executing the following steps through the attention mechanism model:

4. The data cleaning method of claim 3, wherein calculating the attention weight of each feature vector in the context semantic feature vector sequence of the main knowledge point problem and calculating the attention weight of each feature vector in the context semantic feature vector sequence of the sub knowledge point problem comprises:

reducing the dimension of the background information to half;

5. The data cleansing method according to claim 4, wherein the similarity value between the background information and the feature vector at each time is calculated by the following formula:

sim_t＝bkg·h_t

6. The data cleansing method of claim 4, wherein for each feature vector in the context semantic feature vector sequence of the master knowledge point problem and the context semantic feature vector sequence of the sub knowledge point problem, the attention weight of the feature vector is calculated by the following formula:

wherein, a_tAn attention weight representing a feature vector at time t; sim_tAnd representing the similarity value corresponding to the feature vector at the time t, and N represents the total time.

7. The data cleansing method of claim 3, wherein outputting, for each pair of knowledge points, the semantic matching value for the pair of knowledge points based on the sequence of feature vectors of the main knowledge point problem and the sequence of feature vectors of the sub knowledge point problem comprises:

calculating the similarity value between each feature vector in the feature vector sequence of the main knowledge point problem and each feature vector in the feature vector sequence of the sub knowledge point problem, wherein the similarity values form a similarity matrix;

8. A data cleansing apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the data cleansing method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program that executes the data cleansing method according to any one of claims 1 to 7.