CN112541054B

CN112541054B - Knowledge base question and answer management method, device, equipment and storage medium

Info

Publication number: CN112541054B
Application number: CN202011479831.3A
Authority: CN
Inventors: 李骁; 赖众程; 高洪喜; 倪佳; 许海金; 李筱艺; 何凤连; 林志超; 高静; 李会璟; 史文鑫; 张舒婷; 李林毅
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2023-08-29
Anticipated expiration: 2040-12-15
Also published as: CN112541054A

Abstract

The application relates to the technical field of artificial intelligence, and discloses a method, a device, equipment and a storage medium for managing a question and answer of a knowledge base, wherein the method comprises the following steps: performing entity identification, entity data duplication removal and entity data alignment treatment on a plurality of to-be-treated question and answer pairs to obtain a question and answer pair set after entity alignment; according to the entity aligned target question and answer pair sets, similarity judgment is carried out to obtain suspected similar target question and answer pair sets, incomplete similar target question and answer pair sets and dissimilar target question and answer pair sets, and the dissimilar target question and answer pair sets are updated into a target knowledge base; and carrying out attribute duplication elimination treatment and attribute value duplication elimination treatment on the suspected similar question and answer pair set and the incomplete similar question and answer pair set to obtain a duplicated question and answer pair set, and updating the duplicated question and answer pair set into the target knowledge base. The quality of the knowledge base is improved, continuous manual participation in the treatment process is not needed, and the treatment efficiency is improved.

Description

Knowledge base question and answer management method, device, equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for managing a question mark of a knowledge base.

Background

Standard questions and standard answers in the knowledge base are typically presented in pairs, commonly referred to as standard question-answer pairs. The question and answer pairs are core structures of knowledge base knowledge systems, and the number of standard questions directly influences the number of similar questions, so that the ability of the question and answer robot is influenced. The questions of the knowledge base of the prior art are as follows: (1) knowledge repetition between standard questions; (2) One standard problem has multiple intents or no intents; (3) The number of intentions of the standard questions and the number of values of the standard answers are not equal, so that the answers are incomplete or excessive.

Disclosure of Invention

The application mainly aims to provide a method, a device, equipment and a storage medium for managing questions and answers of a knowledge base, and aims to solve the technical problems that in the prior art, knowledge of knowledge bases among standard questions is repeated, one standard question has multiple intentions or no intentions, and the number of intentions of the standard question is not equal to the number of values of standard answers, so that the answers are incomplete or excessive.

In order to achieve the above object, the present application provides a method for managing questions and answers in a knowledge base, the method comprising:

obtaining a plurality of question and answer pairs to be treated, wherein the question and answer pairs to be treated comprise: the method comprises the steps of marking text data to be treated and marking text data to be treated;

performing entity recognition on the input entity recognition model by the multiple question and answer pairs to be treated to obtain entity data sets to be de-duplicated corresponding to the multiple question and answer pairs to be treated, wherein the entity recognition model is a model obtained based on a pre-training model bert_this and CRF network training;

performing entity data deduplication processing on the entity data set to be deduplicated to obtain entity data sets after deduplication corresponding to the plurality of question-answer pairs to be treated;

performing entity data alignment processing according to the entity data set subjected to duplication removal and the plurality of question and answer pairs to be treated to obtain a question and answer pair set subjected to entity alignment;

according to the aligned target question and answer pair sets of the entity, similarity judgment is carried out to obtain a suspected similar target question and answer pair set, an incomplete similar target question and answer pair set and an dissimilar target question and answer pair set, and the dissimilar target question and answer pair set is updated into a target knowledge base;

And carrying out attribute duplication elimination treatment and attribute value duplication elimination treatment on the suspected similar question and answer pair sets and the incompletely similar question and answer pair sets to obtain a duplicated question and answer pair set, and updating the duplicated question and answer pair set into the target knowledge base.

Further, before the step of performing entity recognition on the input entity recognition model by the plurality of question and answer pairs to be treated to obtain the entity data set to be de-duplicated corresponding to the plurality of question and answer pairs to be treated, the method further includes:

obtaining a plurality of training samples, the training samples comprising: text sample data to be trained and text sample calibration data;

dividing the training samples according to a preset dividing rule to obtain a training set and a verification set;

training a first model to be trained by adopting the training set, determining the first model to be trained after training is finished as a first model to be verified, wherein the first model to be trained is a model obtained based on the pre-training model bert_this and the CRF network;

and verifying the first model to be verified by adopting the verification set, and determining the first model to be verified as the entity identification model when verification is successful.

Further, the step of performing entity data alignment processing according to the entity data set after duplication removal and the plurality of question and answer pairs to be treated to obtain a question and answer pair set after entity alignment includes:

performing pairwise similarity calculation on all entity data in the de-duplicated entity data set by adopting a minimum editing distance calculation method to obtain entity similarity matrixes corresponding to the multiple question and answer pairs to be treated;

extracting entity similarity from the entity similarity matrixes corresponding to the plurality of question mark-answer pairs to be treated according to columns respectively to obtain a plurality of entity similarity sets to be optimized;

acquiring a first similarity threshold, and acquiring entity similarity larger than the first similarity threshold from each entity similarity set to be optimized respectively to obtain entity similarity sets to be aligned corresponding to the entity similarity sets to be optimized respectively;

obtaining entity data sets to be screened corresponding to a plurality of entity similarity sets to be aligned according to the entity similarity sets to be aligned corresponding to each entity similarity set to be optimized and the entity data sets after de-duplication respectively;

Respectively obtaining entity data with the most characters for the entity data sets to be screened corresponding to each entity similarity set to be aligned, and obtaining optimal entity data corresponding to each of the entity similarity sets to be aligned;

and replacing the plurality of question and answer pairs to be treated by adopting the optimal entity data corresponding to each entity similarity set to be aligned to obtain a question and answer pair set after the entity alignment.

Further, the step of performing similarity judgment according to the aligned question-answer pair sets of the entity to obtain a suspected similar question-answer pair set, an incomplete similar question-answer pair set and an dissimilar question-answer pair set includes:

dividing the aligned question and answer pairs of the entities by adopting text categories and entity data to obtain a plurality of question and answer text data same entity subsets and a plurality of answer text data same entity subsets;

performing similarity calculation on the text data of each question after the text data of each question is aligned with the entities in the entity subset by adopting a cosine similarity calculation method to obtain a question similarity matrix corresponding to each of the plurality of question text data and the entity subset;

Performing pairwise similarity calculation on the label text data aligned with the entities in the entity subsets by adopting a cosine similarity calculation method to obtain label similarity matrixes corresponding to the label text data and the entity subsets respectively;

and carrying out similarity judgment according to the entity aligned question and answer pair set, the question similarity matrix and the question similarity matrix to obtain the suspected similar question and answer pair set, the incomplete similar question and answer pair set and the dissimilar question and answer pair set.

Further, the step of performing similarity judgment according to the entity aligned question-answer pair set, the question similarity matrix, and the question similarity matrix to obtain the suspected similar question-answer pair set, the incomplete similar question-answer pair set, and the dissimilar question-answer pair set includes:

extracting the inter-scale similarity from the inter-scale similarity matrix according to the columns respectively to obtain a plurality of inter-scale similarity sets to be optimized;

extracting the similarity of the answers from the similarity matrix of the answers according to the columns respectively to obtain a plurality of similarity sets of the answers to be optimized;

Acquiring a second similarity threshold;

extracting the question similarity to be optimized and the answer similarity to be optimized corresponding to each entity-aligned question answer pair of the entity-aligned question answer pair set from the multiple question similarity sets to be optimized and the multiple answer similarity sets to be optimized respectively;

when the similarity between the questions to be optimized and the similarity between the questions to be optimized corresponding to the aligned questions and answers of the entity in the aligned questions and answers pair set are both greater than the second similarity threshold, determining the suspected similar question and answer pair set according to the question and answer pairs of the entity after alignment;

when any one of the to-be-optimized question similarity and the to-be-optimized question similarity corresponding to the entity aligned question and answer pair of the entity aligned question and answer pair set is smaller than or equal to the second similarity threshold, determining a to-be-distinguished question and answer pair set according to the entity aligned question and answer pair;

when the attribute quantity of the target question text data and the attribute quantity of the target question text data of the target question answer pairs aligned according to the entity in the target question answer pair set to be distinguished are equal to 1, determining the dissimilar target question answer pair set according to the target question answer pairs aligned by the entity;

And when one of the attribute number of the text data of the questions and the answer is not equal to 1 according to the aligned questions and answer pairs of the entities in the to-be-distinguished question and answer pair sets, determining the incompletely similar question and answer pair sets according to the aligned questions and answer pairs of the entities.

Further, the step of performing attribute duplication removal processing and attribute value duplication removal processing on the suspected similar question-answer pair set and the incompletely similar question-answer pair set to obtain a duplicated question-answer pair set includes:

searching and deleting the question text data with the attribute number equal to 0 for the suspected similar question answer pair set and the incomplete similar question answer pair set respectively to obtain an optimized suspected similar question answer pair set and an optimized incomplete similar question answer pair set;

sequentially extracting a question and answer pair from the optimized suspected similar question and answer pair set and the optimized incompletely similar question and answer pair set respectively to obtain a question and answer pair to be de-duplicated;

when the number of attributes of the question text data in the question label pair to be deduplicated is equal to 1, taking the target question and answer pair to be subjected to duplication elimination as a target question and answer pair subjected to duplication elimination;

When the attribute number of the question text data in the question and answer pair to be de-duplicated is larger than 1, or the attribute value of the question and answer text data in the question and answer pair to be de-duplicated is larger than 1, carrying out attribute separation and attribute value separation on the question and answer pair to be de-duplicated to obtain a plurality of single-attribute-value question text data and a plurality of single-attribute-value answer text data;

the principle of retaining the longest characters is adopted to carry out identical attribute duplication elimination processing and identical attribute value duplication elimination processing on the multiple single-attribute value question text data and the multiple single-attribute value answer text data, so as to obtain a single-attribute value question text data set after duplication elimination processing and a single-attribute value answer text data set after duplication elimination processing;

pairing each single-attribute-value question text data of the single-attribute-value question text data set subjected to the duplication elimination treatment with each single-attribute-value question text data of the single-attribute-value question text data set subjected to the duplication elimination treatment to obtain a plurality of question-answer pairs subjected to the duplication elimination;

and determining the repeated question and answer pair set according to all the repeated question and answer pairs.

Further, the step of searching and deleting the question text data with the attribute number equal to 0 for the suspected similar question answer pair set and the incomplete similar question answer pair set to obtain an optimized suspected similar question answer pair set and an optimized incomplete similar question answer pair set respectively includes:

finding out the question text data with the attribute number equal to 0 from all the question text data in the suspected similar question answer pair set to obtain first question text data to be deleted;

deleting the first to-be-deleted question text data from the suspected similar question-answer pair set to obtain the optimized suspected similar question-answer pair set;

finding out the answer text data with the attribute number equal to 0 from all the answer text data in the incomplete similar answer pair set, and obtaining first answer text data to be deleted;

and deleting the first to-be-deleted mark and answer text data from the incomplete similar mark and answer pair set to obtain the optimized incomplete similar mark and answer pair set.

The application also provides a treatment device for the question and answer of the knowledge base, which comprises:

The data acquisition module is used for acquiring a plurality of question and answer pairs to be treated, and the question and answer pairs to be treated comprise: the method comprises the steps of marking text data to be treated and marking text data to be treated;

the entity recognition module is used for carrying out entity recognition on the input entity recognition model by the plurality of question mark answers to be treated to obtain an entity data set to be de-duplicated corresponding to the plurality of question mark answers to be treated, wherein the entity recognition model is a model obtained based on the pre-training model bert_this and CRF network training;

the entity data de-duplication processing module is used for performing entity data de-duplication processing on the entity data set to be de-duplicated to obtain a de-duplicated entity data set corresponding to the multiple question mark pairs to be treated;

the entity data alignment processing module is used for carrying out entity data alignment processing according to the entity data set subjected to duplication removal and the plurality of question and answer pairs to be treated to obtain a question and answer pair set subjected to entity alignment;

the similarity judging module is used for carrying out similarity judgment on the target question and answer pair sets after the entity alignment to obtain a suspected similar target question and answer pair set, an incomplete similar target question and answer pair set and an dissimilar target question and answer pair set, and updating the dissimilar target question and answer pair set into a target knowledge base;

And the attribute duplication elimination processing and attribute value duplication elimination processing module is used for carrying out attribute duplication elimination processing and attribute value duplication elimination processing on the suspected similar question-answer pair set and the incompletely similar question-answer pair set to obtain a duplicated question-answer pair set, and updating the duplicated question-answer pair set into the target knowledge base.

The application also proposes a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

The application also proposes a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method of any of the above.

The application relates to a method, a device, equipment and a storage medium for managing a question and answer of a knowledge base, which are characterized in that a plurality of question and answer pairs to be managed are subjected to entity recognition on an input entity recognition model to obtain a plurality of question and answer pairs to be managed, the corresponding entity data sets to be subjected to entity data duplication elimination is carried out on the to-be-treated entity data sets to be subjected to entity data duplication elimination, the entity data sets to be managed and the plurality of question and answer pairs to be managed are subjected to entity data alignment treatment according to the to-be-treated entity data sets and the plurality of question and answer pairs to be managed, similarity judgment is carried out on the entity aligned question and answer pairs according to the entity aligned question and answer pairs to obtain a suspected similar question and answer pair, an dissimilar question and answer pair, the suspected similar question and answer pair are updated into a target knowledge base, the suspected similar question and answer pair and the dissimilar question and answer pair are subjected to duplication elimination processing on the corresponding to the target knowledge base, the suspected similar question and the dissimilar question and answer pair are subjected to repeated, and the answer pair is subjected to attribute restoration is carried out in the knowledge base, and the knowledge base is not required, and the question and answer values are continuously satisfied after the answer and the answer values are equal to each other.

Drawings

FIG. 1 is a flow chart of a method for managing questions and answers in a knowledge base according to an embodiment of the application;

FIG. 2 is a schematic block diagram of a treatment device for question and answer in a knowledge base according to an embodiment of the present application;

fig. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In order to solve the technical problems that knowledge in the knowledge base in the prior art is repeated among standard questions, one standard question has multiple intentions or no intentions, the number of intentions of the standard question is not equal to the number of values of the standard answer, so that the answer is incomplete or excessive, the application provides a method for managing the standard questions and the answers of the knowledge base, and the method is applied to the technical field of artificial intelligence. According to the method for managing the question mark and answer of the knowledge base, entity identification, entity data duplication removal and entity data alignment are carried out, then similar text data judgment is carried out, suspected similar, incomplete similar and dissimilar question mark and answer pairs are obtained, attribute duplication removal and attribute value duplication removal are carried out on the suspected similar and incomplete similar question mark and answer pairs, and dissimilar question mark and answer pairs, attribute duplication removal and attribute value duplication removal are updated into the target knowledge base, so that knowledge duplication of knowledge stock among standard questions in the prior art, multiple intention or no intention of one standard question exist in the target knowledge base, technical questions with incomplete answers or excessive answers caused by unequal number of intention of the standard questions and number of values of the standard answers are avoided, quality of the knowledge base is improved, continuous manual participation in a management process is not needed, and management efficiency is improved.

Referring to fig. 1, in an embodiment of the present application, a method for managing a question mark of a knowledge base is provided, where the method includes:

s1: obtaining a plurality of question and answer pairs to be treated, wherein the question and answer pairs to be treated comprise: the method comprises the steps of marking text data to be treated and marking text data to be treated;

s2: performing entity recognition on the input entity recognition model by the multiple question and answer pairs to be treated to obtain entity data sets to be de-duplicated corresponding to the multiple question and answer pairs to be treated, wherein the entity recognition model is a model obtained based on a pre-training model bert_this and CRF network training;

s3: performing entity data deduplication processing on the entity data set to be deduplicated to obtain entity data sets after deduplication corresponding to the plurality of question-answer pairs to be treated;

s4: performing entity data alignment processing according to the entity data set subjected to duplication removal and the plurality of question and answer pairs to be treated to obtain a question and answer pair set subjected to entity alignment;

s5: according to the aligned target question and answer pair sets of the entity, similarity judgment is carried out to obtain a suspected similar target question and answer pair set, an incomplete similar target question and answer pair set and an dissimilar target question and answer pair set, and the dissimilar target question and answer pair set is updated into a target knowledge base;

S6: and carrying out attribute duplication elimination treatment and attribute value duplication elimination treatment on the suspected similar question and answer pair sets and the incompletely similar question and answer pair sets to obtain a duplicated question and answer pair set, and updating the duplicated question and answer pair set into the target knowledge base.

According to the embodiment, entity identification is carried out on an input entity identification model through a plurality of to-be-treated standard questions, a plurality of to-be-treated standard questions are obtained, corresponding to-be-treated entity data sets are subjected to entity data duplication elimination processing, corresponding to-be-treated entity data sets are obtained, entity data alignment processing is carried out according to the duplication elimination-treated entity data sets and the plurality of to-be-treated standard questions, a standard question answer pair set is obtained, similarity judgment is carried out according to the entity alignment-treated standard questions, a suspected similar standard question answer pair set, an incomplete similar standard question answer pair set and an incomplete similar standard question answer pair set are obtained, the suspected similar standard question answer pair set is updated to a target knowledge base, attribute duplication elimination processing and attribute value duplication processing are carried out on the suspected similar standard question answer pair set and the incomplete standard question answer pair set according to the entity data sets, the fact that the repeated standard question answer pair sets are subjected to the repeated standard questions are not required to be updated in the knowledge base, and the technical problem of the prior art is solved, and the technical problem of the prior art is solved continuously is solved.

Corresponding to S1, a plurality of question and answer pairs to be treated can be obtained from a database, and the question and answer pairs to be treated can be input by a user, and can also be sent by a third party application system.

The question and answer pairs to be treated, namely the question and answer pairs which need to be treated to avoid repeated knowledge among standard questions, multiple intentions or no intentions of one standard question, incomplete answers or excessive answers caused by unequal number of intentions of the standard questions and number of values of the standard answers.

The question text data to be treated, namely the question text data which needs to be treated to avoid repeated knowledge among standard questions, multiple intentions or no intentions of one standard question, incomplete answer or excessive answer caused by unequal number of intentions of the standard questions and number of values of the standard answers. The question text data is text data describing a question.

The standard answer text data to be treated, namely the standard answer text data which needs to be treated to avoid incomplete answer or excessive answer caused by repeated knowledge among standard questions, multiple intentions or no intentions of one standard question, unequal number of intentions of the standard question and number of values of the standard answer. The answer text data is text data describing an answer.

The same question and answer pair to be treated is characterized in that the text data of the questions to be treated is the text data of the answers aiming at the text data of the questions to be treated.

And S2, inputting the to-be-treated question text data and the to-be-treated question text data of each to-be-treated question and answer pair of the plurality of to-be-treated question and answer pairs into an entity recognition model for entity recognition, and taking all entity data output by the entity recognition model as an entity data set.

The pre-training model bert_this is a model obtained based on the network training of the bert-base-Chinese bert pre-training model. The method of the pre-training model bert_this obtained based on the bert-base-Chinese network training can be selected from the prior art, and will not be described in detail here.

The method of the entity recognition model based on the pre-training model bert_this and CRF (conditional random field) network training may be selected from the prior art, and will not be described herein.

And corresponding to S3, performing deduplication processing on all entity data in the entity data set to be deduplicated, and taking the entity data set to be deduplicated after deduplication processing as the entity data set after deduplication corresponding to the plurality of question-answer pairs to be treated. That is, in the multiple to-be-treated question mark pairs, each entity data has uniqueness in the corresponding entity data set after duplication removal.

Entity data refers to entities in triples (an entity is an abstraction of an objective individual, and a person, a movie, a sentence can all be considered as one entity).

Corresponding to S4, determining optimal entity data according to the entity similarity among the entity data in the entity data set after the duplication removal; and replacing the plurality of question and answer pairs to be treated by adopting optimal entity data, and taking the replaced plurality of question and answer pairs to be treated as a question and answer pair set after entity alignment.

And S5, performing similarity judgment according to the entity data of the text data of the questions and the answers of the pairs after the entity alignment, and dividing the questions and the answers of the questions after the entity alignment into a suspected similar questions and answers pair, an incomplete similar questions and answers pair and an dissimilar questions and answers pair according to the similarity judgment result. The dissimilar question-answer pair sets have no knowledge repetition among standard questions in the knowledge base in the prior art, multiple intentions or no intentions exist in one standard question, and incomplete answer or excessive answer caused by unequal number of intentions of the standard questions and number of values of the standard answers, so that the dissimilar question-answer pair sets can be directly updated into the target knowledge base. The target knowledge base is a newly built knowledge base after treatment, so that the knowledge base in the target knowledge base has no knowledge repetition among standard questions in the knowledge base of the prior art, one standard question has multiple intentions or no intentions, and the problem that incomplete answers or excessive answers are caused by unequal number of intentions of the standard questions and number of values of the standard answers.

And corresponding to S6, performing attribute duplication elimination processing and attribute value duplication elimination processing on the suspected similar question-answer pair set and the incompletely similar question-answer pair set, and determining the duplicated question-answer pair set according to the suspected similar question-answer pair set and the incompletely similar question-answer pair set after the attribute duplication elimination processing and the attribute value duplication elimination processing. The repeated standard question and answer pair set does not have the knowledge repetition among standard questions in the knowledge base in the prior art, one standard question has multiple intentions or no intentions, the number of intentions of the standard question is unequal to the number of values of the standard answer, so that the repeated standard question and answer pair set can be directly updated into the target knowledge base.

It can be understood that the initial state of the target knowledge base is null, and the target knowledge base is added with the question and answer pairs through step S5 and step S6.

In one embodiment, before the step of performing entity recognition on the input entity recognition model by using the plurality of question and answer pairs to be treated to obtain the entity data set to be deduplicated corresponding to the plurality of question and answer pairs to be treated, the method further includes:

S021: obtaining a plurality of training samples, the training samples comprising: text sample data to be trained and text sample calibration data;

s022: dividing the training samples according to a preset dividing rule to obtain a training set and a verification set;

s023: training a first model to be trained by adopting the training set, determining the first model to be trained after training is finished as a first model to be verified, wherein the first model to be trained is a model obtained based on the pre-training model bert_this and the CRF network;

s024: and verifying the first model to be verified by adopting the verification set, and determining the first model to be verified as the entity identification model when verification is successful.

The embodiment realizes the entity recognition model obtained based on the pre-training model bert_this and the CRF network training, and provides a basis for entity recognition on a plurality of question and answer pairs to be treated.

Corresponding to S021, a plurality of training samples may be obtained from the database, or a plurality of training samples input by the user, or a plurality of training samples sent by the third party application system.

Optionally, the training samples are training samples obtained based on an untreated knowledge base.

Each training sample includes a text sample data to be trained and a text sample calibration data. The text sample data to be trained can be the text data of the questions in the untreated knowledge base, and can also be the text data of the answers in the untreated knowledge base.

In the same training sample, the text sample calibration data is a calibration result for entity data identification of the text sample data to be trained.

Corresponding to S022, the training samples in the plurality of training samples are divided into a training set or a verification set according to a preset division rule, that is, the training samples in the training set and the verification set are not repeated.

Optionally, the preset division rule is that 80% is divided into the training set and 20% is divided into the verification set. It is to be understood that the preset dividing rule may be other rules, which are not specifically limited herein.

And corresponding to S023, training the first model to be trained by adopting the training samples in the training set, adopting a cross entropy loss function as a loss function, adopting an Adam optimizer, setting the learning rate of the Adam optimizer to be 1e-5, setting the measurement method to be the precision rate, and determining the first model to be trained after the training is finished as the first model to be verified.

The first model to be trained comprises: the bert-base-channel module and the CRF module. The bert-base-Chinese module is trained to obtain the bert_this module.

The batch of the bert_this module is 64, the learning rate is 3e-5, the training step number is 50000, and the learning rate optimizing method (i.e. wakeup) step number is 5000.

Corresponding to S024, the method for verifying the first model to be verified by using the verification set may be selected from the prior art, which is not described herein in detail.

And when the verification is successful, determining that the first model to be verified is the entity identification model, otherwise, re-executing the steps S023 to S024 until the verification is successful.

In one embodiment, the step of performing entity data alignment processing according to the de-duplicated entity data set and the plurality of to-be-treated question-answer pairs to obtain an entity-aligned question-answer pair set includes:

s41: performing pairwise similarity calculation on all entity data in the de-duplicated entity data set by adopting a minimum editing distance calculation method to obtain entity similarity matrixes corresponding to the multiple question and answer pairs to be treated;

s42: extracting entity similarity from the entity similarity matrixes corresponding to the plurality of question mark-answer pairs to be treated according to columns respectively to obtain a plurality of entity similarity sets to be optimized;

S43: acquiring a first similarity threshold, and acquiring entity similarity larger than the first similarity threshold from each entity similarity set to be optimized respectively to obtain entity similarity sets to be aligned corresponding to the entity similarity sets to be optimized respectively;

s44: obtaining entity data sets to be screened corresponding to a plurality of entity similarity sets to be aligned according to the entity similarity sets to be aligned corresponding to each entity similarity set to be optimized and the entity data sets after de-duplication respectively;

s45: respectively obtaining entity data with the most characters for the entity data sets to be screened corresponding to each entity similarity set to be aligned, and obtaining optimal entity data corresponding to each of the entity similarity sets to be aligned;

s46: and replacing the plurality of question and answer pairs to be treated by adopting the optimal entity data corresponding to each entity similarity set to be aligned to obtain a question and answer pair set after the entity alignment.

According to the embodiment, the entity data alignment processing is carried out according to the entity similarity among the entity data in the entity data set after the duplication removal, so that the entity data normalization of the question and answer pair set after the entity alignment is improved, and the improvement of the quality of the treatment of the question and answer is facilitated.

Corresponding to S41, the calculation formula of the entity similarity is as follows:

wherein similarity is the entity similarity of the corresponding entity similarity matrix of the question and answer pairs to be treated, ED _AB L is the minimum edit distance between entity data A and entity data B _A For the character length of the entity data A, L _B For the character length of the entity data B, max () is a function of extracting the maximum value.

And if the entity number in the entity data set after the duplication removal is EN, the number of rows and columns of the entity similarity matrix corresponding to the question mark answer pairs to be treated is EN, and each element in the entity similarity matrix represents the similarity between the entity data corresponding to the row number and the entity data corresponding to the column number. For example, the element of the 3 rd row and the 5 th column of the entity similarity matrix is the similarity between the entity data corresponding to the 3 rd row and the entity data corresponding to the 5 th column, which is not specifically limited herein.

Optionally, the row number and the column number with the same value in the entity similarity matrix correspond to corresponding entity data. For example, the 3 rd row and the 3 rd column of the entity similarity matrix correspond to the same element, and the example is not limited herein.

And corresponding to S42, extracting the entity similarity from the entity similarity matrix corresponding to the plurality of question-answer pairs to be treated according to columns, namely taking each column element in the entity similarity matrix as an entity similarity set to be optimized.

It can be understood that the number of entities in the entity data set after de-duplication is EN, and the number of entity similarity sets to be optimized in the plurality of entity similarity sets to be optimized is also EN.

It may be understood that in another embodiment, the entity similarity may be extracted from the entity similarity matrix corresponding to the plurality of to-be-treated question-and-answer pairs by rows, so as to obtain a plurality of to-be-optimized entity similarity sets, and the effect of the method for extracting the entity similarity by rows and the method for extracting the entity similarity by columns is the same.

Corresponding to S43, the first similarity threshold may be obtained from the database, or may be a first similarity threshold input by the user, or may be a first similarity threshold sent by the third party application system. It will be appreciated that the first similarity threshold may also be written in a program file embodying the application.

Obtaining an entity similarity set to be optimized from a plurality of entity similarity sets to be optimized, and obtaining a target entity similarity set to be optimized; comparing each entity similarity in the target entity similarity set to be optimized with a first similarity threshold, taking the entity similarity in the target entity similarity set to be optimized which is larger than the first similarity threshold as the entity similarity to be aligned corresponding to the entity similarity set to be optimized, and taking all the entity similarities to be aligned corresponding to the entity similarity set to be optimized as the entity similarity set to be aligned corresponding to the entity similarity set to be optimized; and repeatedly executing the step of acquiring one entity similarity set to be optimized from the plurality of entity similarity sets to be optimized to obtain a target entity similarity set to be optimized until the entity similarity sets to be aligned corresponding to the plurality of entity similarity sets to be optimized are determined.

Corresponding to S44, extracting one entity similarity set to be aligned from the entity similarity sets to be aligned corresponding to each of the plurality of entity similarity sets to be optimized as a target entity similarity set to be aligned; searching entity data in the entity data set after de-duplication by using the entity data to be aligned in the entity data set to be aligned in the target, using the entity data searched in the entity data set after de-duplication as entity data to be screened corresponding to the entity data set to be aligned in the target, and using all entity data to be screened corresponding to the entity data set to be aligned in the target as entity data set to be screened corresponding to the entity data set to be aligned in the target; and repeating the step of extracting one entity similarity set to be aligned from the entity similarity sets to be aligned corresponding to the entity similarity sets to be optimized as the target entity similarity set to be aligned until the entity data sets to be screened corresponding to the entity similarity sets to be aligned are determined.

For example, in step S42, when extracting according to the columns, the entity similarities of the row number 21 and the column number 32 in the target entity similarity set to be aligned are extracted, and the entity data corresponding to the entity data set after de-duplication of the row number 21 and the column number 32 in the entity similarity set to be aligned are used as the entity data to be screened of the entity similarities of the row number 21 and the column number 32 in the target entity similarity set to be aligned, which is not limited specifically herein.

Corresponding to S45, extracting one entity similarity set to be aligned from the entity data sets to be screened corresponding to each of the plurality of entity similarity sets to be aligned as a target entity similarity set to be aligned; obtaining entity data with the most characters from the entity data set to be screened corresponding to the entity similarity set to be aligned, and obtaining optimal entity data corresponding to the entity similarity set to be aligned; and repeatedly executing the step of extracting one entity similarity set to be aligned from the entity data sets to be screened corresponding to each of the entity similarity sets to be aligned as the target entity similarity set to be aligned until the optimal entity data corresponding to each of the entity similarity sets to be aligned is determined.

Corresponding to S46, extracting one entity similarity set to be aligned from the optimal entity data corresponding to each of the plurality of entity similarity sets to be aligned as a target entity similarity set to be aligned; taking entity data corresponding to the entity similarity set to be aligned of the target as an entity data set to be replaced corresponding to the entity similarity set to be aligned of the target; replacing the plurality of question and answer pairs to be treated by adopting the entity data set to be replaced and the optimal entity data corresponding to the entity similarity set to be aligned; and repeatedly executing the step of extracting one entity similarity set to be aligned from the optimal entity data corresponding to each of the plurality of entity similarity sets to be aligned as the target entity similarity set to be aligned until the optimal entity data corresponding to each of all the entity similarity sets to be aligned is completed to replace the plurality of question mark pairs to be treated.

And replacing the plurality of question-answer pairs to be treated by adopting the entity data set to be replaced and the optimal entity data corresponding to the entity similarity set to be aligned, namely replacing the same entity data in the entity data set to be replaced corresponding to the entity similarity set to be aligned in the plurality of question-answer pairs to be treated by adopting the optimal entity data corresponding to the entity similarity set to be aligned.

In one embodiment, the step of performing similarity judgment according to the aligned pairs of questions of the entity to obtain a suspected similar pair of questions, a non-similar pair of questions and answers includes:

s51: dividing the aligned question and answer pairs of the entities by adopting text categories and entity data to obtain a plurality of question and answer text data same entity subsets and a plurality of answer text data same entity subsets;

s52: performing similarity calculation on the text data of each question after the text data of each question is aligned with the entities in the entity subset by adopting a cosine similarity calculation method to obtain a question similarity matrix corresponding to each of the plurality of question text data and the entity subset;

S53: performing pairwise similarity calculation on the label text data aligned with the entities in the entity subsets by adopting a cosine similarity calculation method to obtain label similarity matrixes corresponding to the label text data and the entity subsets respectively;

s54: and carrying out similarity judgment according to the entity aligned question and answer pair set, the question similarity matrix and the question similarity matrix to obtain the suspected similar question and answer pair set, the incomplete similar question and answer pair set and the dissimilar question and answer pair set.

The embodiment realizes the similarity judgment of the entity data between the question text data and the answer text data of the question and answer pair set after the entity alignment, is favorable for classifying the entity data between the question text data and the answer text data of the question and answer pair set after the entity alignment according to the similarity judgment result, and provides a basis for solving the technical problems of repeated knowledge among standard questions, multiple intentions or no intentions of one standard question, incomplete answers or excessive answers caused by unequal number of intentions of the standard questions and the number of values of the standard answers.

And corresponding to S51, dividing the entity-aligned question text data set in the entity-aligned question answer pair set according to the entity data to obtain a plurality of question text data and entity subsets, wherein the same question text data and entity subsets have the same entity data. And dividing the entity-aligned label text data set in the entity-aligned label answer pair set according to the entity data to obtain a plurality of label text data and entity subsets, wherein the same label text data and entity subsets have the same entity data.

Corresponding to S52, taking any one of the plurality of target text data and entity subsets as the target text data and entity subset; inputting target question text data aligned with each entity in the entity subset into a pre-training language model BERT_this for vector prediction of a marker bit to obtain target first vectors corresponding to the target question text data aligned with all the entities in the entity subset; and performing query similarity calculation on any two target first vectors in target first vectors corresponding to all the entity-aligned query text data in the target query text data and entity subsets by adopting a cosine similarity calculation method to obtain a query similarity matrix corresponding to the target query text data and the entity subsets.

The calculation formula for identifying the similarity cos (theta) in the cosine similarity calculation method is as follows:

wherein a represents a first of any two of the target first vectors and b represents a second of any two of the target first vectors. a.b is the set of multiplications of the pointing quantity a and the vector b, and a is the modulus of the calculated vector a.

Corresponding to S53, taking any one of the multiple target text data and entity subsets as the target text data and entity subset; inputting the target answering text data aligned with each entity in the entity subset into a pre-training language model BERT_this for vector prediction of a marker bit to obtain target second vectors corresponding to the target answering text data aligned with all the entities in the entity subset; and performing the label similarity calculation on any two target second vectors in the target second vectors corresponding to all the entity aligned label text data in the target label text data and entity subsets by adopting a cosine similarity calculation method to obtain a label similarity matrix corresponding to the target label text data and entity subsets.

And corresponding to S54, performing similarity judgment according to the question similarity matrix and the question similarity matrix, classifying the question and answer pairs in the question and answer pair set after the entity alignment according to the similarity judgment result, and obtaining the suspected similar question and answer pair set, the incomplete similar question and answer pair set and the dissimilar question and answer pair set after the classification.

In one embodiment, the step of performing similarity judgment according to the aligned question-answer pair set, the question similarity matrix, and the question similarity matrix to obtain the suspected similar question-answer pair set, the incomplete similar question-answer pair set, and the dissimilar question-answer pair set includes:

s541: extracting the inter-scale similarity from the inter-scale similarity matrix according to the columns respectively to obtain a plurality of inter-scale similarity sets to be optimized;

s542: extracting the similarity of the answers from the similarity matrix of the answers according to the columns respectively to obtain a plurality of similarity sets of the answers to be optimized;

s543: acquiring a second similarity threshold;

s544: extracting the question similarity to be optimized and the answer similarity to be optimized corresponding to each entity-aligned question answer pair of the entity-aligned question answer pair set from the multiple question similarity sets to be optimized and the multiple answer similarity sets to be optimized respectively;

s545: when the similarity between the questions to be optimized and the similarity between the questions to be optimized corresponding to the aligned questions and answers of the entity in the aligned questions and answers pair set are both greater than the second similarity threshold, determining the suspected similar question and answer pair set according to the question and answer pairs of the entity after alignment;

S546: when any one of the to-be-optimized question similarity and the to-be-optimized question similarity corresponding to the entity aligned question and answer pair of the entity aligned question and answer pair set is smaller than or equal to the second similarity threshold, determining a to-be-distinguished question and answer pair set according to the entity aligned question and answer pair;

s547: when the attribute quantity of the target question text data and the attribute quantity of the target question text data of the target question answer pairs aligned according to the entity in the target question answer pair set to be distinguished are equal to 1, determining the dissimilar target question answer pair set according to the target question answer pairs aligned by the entity;

s548: and when one of the attribute number of the text data of the questions and the answer is not equal to 1 according to the aligned questions and answer pairs of the entities in the to-be-distinguished question and answer pair sets, determining the incompletely similar question and answer pair sets according to the aligned questions and answer pairs of the entities.

The embodiment realizes the similarity judgment according to the standard question and answer pair set, the standard question similarity matrix and the standard answer similarity matrix after the entity alignment, is favorable for classifying the entity data between the standard question and answer text data of the standard question and answer pair set after the entity alignment according to the similarity, and provides a basis for solving the technical problems of repeated knowledge among standard questions, multiple intentions or no intentions of one standard question, incomplete answers or excessive answers caused by unequal number of intentions of the standard questions and the value number of the standard answers.

And corresponding to S541, extracting the inter-query similarity from the inter-query similarity matrix according to the columns, namely taking each column element in the inter-query similarity matrix as an inter-query similarity set to be optimized.

It can be understood that in another embodiment, the query similarity may be extracted from the query similarity matrix according to rows, so as to obtain a plurality of query similarity sets to be optimized, and the method of extracting the query similarity according to rows has the same effect as the method of extracting the query similarity according to columns.

And corresponding to S542, extracting the answer similarity from the answer similarity matrix according to columns, namely taking each column element in the answer similarity matrix as an answer similarity set to be optimized.

It may be appreciated that in another embodiment, the answer similarity may be extracted from the answer similarity matrix by rows, so as to obtain a plurality of answer similarity sets to be optimized, and the method of extracting the answer similarity by rows and the method of extracting the answer similarity by columns have the same effect.

Corresponding to S543, the second similarity threshold may be obtained from the database, or the second similarity threshold input by the user, or the second similarity threshold sent by the third party application system. It will be appreciated that the second similarity threshold may also be written in a program file embodying the application.

Corresponding to S544, taking any one of the entity-aligned question-answer pairs in the entity-aligned question-answer pair set as the target entity-aligned question-answer pair; and extracting the target entity aligned question mark pairs from the target entity aligned question mark pairs as the target entity aligned question mark pairs corresponding to the target entity aligned question mark pairs, and extracting the target entity aligned question mark pairs from the target entity aligned question mark pairs as the target entity aligned question mark pairs corresponding to the target entity aligned question mark pairs.

Corresponding to S545, taking any one of the entity-aligned question-answer pairs in the entity-aligned question-answer pair set as the target entity-aligned question-answer pair; when the similarity between the target entity aligned question and answer pairs to be optimized and the similarity between the target entity aligned question and answer pairs to be optimized are both larger than the second similarity threshold, the method means that the target entity aligned question and answer pairs have higher similarity with other data in the entity aligned question and answer pair set, and the target entity aligned question and answer pairs are classified into the suspected similar question and answer pair set.

Corresponding to S546, taking any one of the entity-aligned question-answer pairs in the entity-aligned question-answer pair set as the target entity-aligned question-answer pair; when any one of the to-be-optimized question similarity and the to-be-optimized question similarity corresponding to the target entity aligned question-answer pair is smaller than or equal to the second similarity threshold, the meaning is that the target entity aligned question-answer pair may be similar to other data in the entity aligned question-answer pair set, and the target entity aligned question-answer pair is classified into the to-be-distinguished question-answer pair set.

Corresponding to S547, taking the aligned question-answer pair of any entity in the question-answer pair set to be distinguished as the aligned question-answer pair of the entity to be distinguished; when the attribute number of the text data and the attribute number of the text data of the questions after the alignment of the entities to be distinguished are equal to 1, the attribute in the questions after the alignment of the entities to be distinguished is single, the attribute duplication removing process is not needed, and the questions after the alignment of the entities to be distinguished are classified into the dissimilar questions and answers pairs.

Corresponding to S548, taking the aligned question-answer pair of any entity in the to-be-distinguished question-answer pair set as the aligned question-answer pair of the to-be-distinguished entity; when one of the attribute number of the question text data and the attribute number of the question text data of the aligned entity to be distinguished is not equal to 1, the attribute in the aligned entity to be distinguished is not single, the attribute duplication removal processing is needed, and the aligned entity to be distinguished is classified into the incompletely similar question-answer pair set.

In one embodiment, the step of performing attribute deduplication processing and attribute value deduplication processing on the suspected similar question-answer pair set and the incomplete similar question-answer pair set to obtain a deduplicated question-answer pair set includes:

s61: searching and deleting the question text data with the attribute number equal to 0 for the suspected similar question answer pair set and the incomplete similar question answer pair set respectively to obtain an optimized suspected similar question answer pair set and an optimized incomplete similar question answer pair set;

s62: sequentially extracting a question and answer pair from the optimized suspected similar question and answer pair set and the optimized incompletely similar question and answer pair set respectively to obtain a question and answer pair to be de-duplicated;

S63: when the number of attributes of the question text data in the question label pair to be deduplicated is equal to 1, taking the target question and answer pair to be subjected to duplication elimination as a target question and answer pair subjected to duplication elimination;

s64: when the attribute number of the question text data in the question and answer pair to be de-duplicated is larger than 1, or the attribute value of the question and answer text data in the question and answer pair to be de-duplicated is larger than 1, carrying out attribute separation and attribute value separation on the question and answer pair to be de-duplicated to obtain a plurality of single-attribute-value question text data and a plurality of single-attribute-value answer text data;

s65: the principle of retaining the longest characters is adopted to carry out identical attribute duplication elimination processing and identical attribute value duplication elimination processing on the multiple single-attribute value question text data and the multiple single-attribute value answer text data, so as to obtain a single-attribute value question text data set after duplication elimination processing and a single-attribute value answer text data set after duplication elimination processing;

s66: pairing each single-attribute-value question text data of the single-attribute-value question text data set subjected to the duplication elimination treatment with each single-attribute-value question text data of the single-attribute-value question text data set subjected to the duplication elimination treatment to obtain a plurality of question-answer pairs subjected to the duplication elimination;

S67: and determining the repeated question and answer pair set according to all the repeated question and answer pairs.

The embodiment realizes attribute duplication elimination and attribute value duplication elimination, improves the quality of the treated knowledge base, does not need to continuously participate in the treatment process, and improves the treatment efficiency.

Corresponding to S61, finding out the question text data with the attribute number equal to 0 from the question answer pair set of suspected similar questions, deleting the found question text data from the question answer pair set of suspected similar questions, and obtaining an optimized question answer pair set of suspected similar questions; and finding out the question text data with the attribute number equal to 0 from the incomplete similar question answer pair set, deleting the found question text data from the incomplete similar question answer pair set, and taking the deleted incomplete similar question answer pair set as an optimized incomplete similar question answer pair set. Thus solving the problem that the standard problem has no intention.

Attributes in the number of attributes refer to attributes in a triplet (i.e., an abstraction of a relationship between entities).

And corresponding to S62, extracting a question and answer pair from the optimized suspected similar question and answer pair set, extracting a question and answer pair from the optimized incompletely similar question and answer pair set, and taking each extracted question and answer pair as a question and answer pair to be de-duplicated.

And corresponding to S63, when the attribute number of the question text data in the question and answer pair to be de-duplicated is equal to 1, the question and answer pair to be de-duplicated is single in attribute, attribute de-duplication processing is not needed, and the question and answer pair to be de-duplicated is taken as the question and answer pair after de-duplication.

And corresponding to S64, when the attribute number of the question text data in the question and answer pair to be de-duplicated is greater than 1, or the attribute value of the question and answer text data in the question and answer pair to be de-duplicated is greater than 1, the attribute of the question and answer pair to be de-duplicated is not single, attribute de-duplication processing is required, the question and answer text data in the question and answer pair to be de-duplicated is separated according to single attribute, a plurality of single attribute value question and answer text data are obtained, and the question and answer text data in the question and answer pair to be de-duplicated is separated according to single attribute, so as to obtain a plurality of single attribute value question and answer text data. That is, the single attribute value identifies that the attribute in the text data has uniqueness and the attribute value has uniqueness. The attribute in the single attribute value label text data has uniqueness, and the attribute value has uniqueness.

Attribute values refer to the values of attributes in triples (values are used to describe entities and can be classified into text and numeric types).

Corresponding to S65, performing identical attribute duplication elimination processing and identical attribute value duplication elimination processing on the multiple single-attribute-value standard text data by adopting a principle of retaining the longest characters of the attribute and a principle of retaining the longest characters of the attribute value, so as to obtain a single-attribute-value standard text data set after duplication elimination processing; and carrying out identical attribute duplication removal processing and identical attribute value duplication removal processing on the plurality of single-attribute-value labeled text data by adopting a principle of retaining the longest characters of the attribute and a principle of retaining the longest characters of the attribute, so as to obtain a single-attribute-value labeled text data set after duplication removal processing.

And corresponding to S66, matching the single-attribute-value question text data in the single-attribute-value question text data set subjected to the duplication removal processing with the single-attribute-value question text data in the single-attribute-value question text data set subjected to the duplication removal processing one by one to obtain a plurality of question and answer pairs subjected to the duplication removal.

And repeating the steps S62 to S66 until the processing of all the question and answer pairs in the optimized suspected similar question and answer pair set and the optimized incompletely similar question and answer pair set is completed.

And corresponding to S67, taking all the repeated standard question and answer pairs as the repeated standard question and answer pair set.

In one embodiment, the step of searching and deleting the question text data with the attribute number equal to 0 for the suspected similar question-answer pair set and the incomplete similar question-answer pair set to obtain an optimized suspected similar question-answer pair set and an optimized incomplete similar question-answer pair set includes:

s611: finding out the question text data with the attribute number equal to 0 from all the question text data in the suspected similar question answer pair set to obtain first question text data to be deleted;

s612: deleting the first to-be-deleted question text data from the suspected similar question-answer pair set to obtain the optimized suspected similar question-answer pair set;

s613: finding out the answer text data with the attribute number equal to 0 from all the answer text data in the incomplete similar answer pair set, and obtaining first answer text data to be deleted;

s614: and deleting the first to-be-deleted mark and answer text data from the incomplete similar mark and answer pair set to obtain the optimized incomplete similar mark and answer pair set.

The embodiment realizes that the question text data with the attribute number equal to 0 is deleted, thereby solving the problem that the standard problem has no intention.

And corresponding to S611, finding out the question text data with the attribute number equal to 0 from all the question text data in the suspected similar question answer pair set, and taking the found question text data as first question text data to be deleted.

And corresponding to S612, deleting the first to-be-deleted question text data from the suspected similar question-answer pair set, and taking the deleted suspected similar question-answer pair set as the optimized suspected similar question-answer pair set.

And corresponding to S613, finding out the answer text data with the attribute number equal to 0 from all answer text data in the incomplete similar answer pair set, and taking the found answer text data as first answer text data to be deleted.

And corresponding to S614, deleting the first to-be-deleted mark and answer text data from the incomplete similar mark and answer pair set, and taking the incomplete similar mark and answer pair set after deletion as the optimized incomplete similar mark and answer pair set.

Referring to fig. 2, the application further provides a device for managing the question and answer of the knowledge base, which comprises:

the data acquisition module 100 is configured to acquire a plurality of question-answer pairs to be administered, where the question-answer pairs to be administered include: the method comprises the steps of marking text data to be treated and marking text data to be treated;

the entity recognition module 200 is configured to perform entity recognition on the plurality of question mark answers to be treated to an input entity recognition model, so as to obtain an entity data set to be de-duplicated corresponding to the plurality of question mark answers to be treated, where the entity recognition model is a model obtained based on training of a pre-training model bert_this and a CRF network;

the entity data deduplication processing module 300 is configured to perform entity data deduplication processing on the entity data set to be deduplicated, so as to obtain a deduplicated entity data set corresponding to the plurality of question-answer pairs to be treated;

the entity data alignment processing module 400 is configured to perform entity data alignment processing according to the de-duplicated entity data set and the multiple to-be-treated question and answer pairs, so as to obtain a question and answer pair set after entity alignment;

a similarity judging module 500 for performing similarity judgment on the set according to the query pairs after the entity alignment, obtaining a suspected similar question-answer pair set, an incomplete similar question-answer pair set and a dissimilar question-answer pair set, and updating the dissimilar question-answer pair set into a target knowledge base;

And the attribute duplication removal processing and attribute value duplication removal processing module 600 is configured to perform attribute duplication removal processing and attribute value duplication removal processing on the suspected similar question-answer pair set and the incompletely similar question-answer pair set, obtain a duplicated question-answer pair set, and update the duplicated question-answer pair set to the target knowledge base.

Referring to fig. 3, in an embodiment of the present application, there is further provided a computer device, which may be a server, and an internal structure thereof may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing data such as the treatment method of the question mark and answer of the knowledge base. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by the processor, implements a method for managing the questions and answers of the knowledge base. The method for managing the question and answer of the knowledge base comprises the following steps: obtaining a plurality of question and answer pairs to be treated, wherein the question and answer pairs to be treated comprise: the method comprises the steps of marking text data to be treated and marking text data to be treated; performing entity recognition on the input entity recognition model by the multiple question and answer pairs to be treated to obtain entity data sets to be de-duplicated corresponding to the multiple question and answer pairs to be treated, wherein the entity recognition model is a model obtained based on a pre-training model bert_this and CRF network training; performing entity data deduplication processing on the entity data set to be deduplicated to obtain entity data sets after deduplication corresponding to the plurality of question-answer pairs to be treated; performing entity data alignment processing according to the entity data set subjected to duplication removal and the plurality of question and answer pairs to be treated to obtain a question and answer pair set subjected to entity alignment; according to the aligned target question and answer pair sets of the entity, similarity judgment is carried out to obtain a suspected similar target question and answer pair set, an incomplete similar target question and answer pair set and an dissimilar target question and answer pair set, and the dissimilar target question and answer pair set is updated into a target knowledge base; and carrying out attribute duplication elimination treatment and attribute value duplication elimination treatment on the suspected similar question and answer pair sets and the incompletely similar question and answer pair sets to obtain a duplicated question and answer pair set, and updating the duplicated question and answer pair set into the target knowledge base.

An embodiment of the present application further provides a computer readable storage medium having a computer program stored thereon, the computer program when executed by a processor implementing a method for managing questions and answers of a knowledge base, including the steps of: obtaining a plurality of question and answer pairs to be treated, wherein the question and answer pairs to be treated comprise: the method comprises the steps of marking text data to be treated and marking text data to be treated; performing entity recognition on the input entity recognition model by the multiple question and answer pairs to be treated to obtain entity data sets to be de-duplicated corresponding to the multiple question and answer pairs to be treated, wherein the entity recognition model is a model obtained based on a pre-training model bert_this and CRF network training; performing entity data deduplication processing on the entity data set to be deduplicated to obtain entity data sets after deduplication corresponding to the plurality of question-answer pairs to be treated; performing entity data alignment processing according to the entity data set subjected to duplication removal and the plurality of question and answer pairs to be treated to obtain a question and answer pair set subjected to entity alignment; according to the aligned target question and answer pair sets of the entity, similarity judgment is carried out to obtain a suspected similar target question and answer pair set, an incomplete similar target question and answer pair set and an dissimilar target question and answer pair set, and the dissimilar target question and answer pair set is updated into a target knowledge base; and carrying out attribute duplication elimination treatment and attribute value duplication elimination treatment on the suspected similar question and answer pair sets and the incompletely similar question and answer pair sets to obtain a duplicated question and answer pair set, and updating the duplicated question and answer pair set into the target knowledge base.

According to the method for managing the target answers of the knowledge base, the input entity recognition model is subjected to entity recognition by the target answers to be managed, the entity data sets to be duplicated are obtained, the entity data is duplicated in the entity data sets to be duplicated, the entity data sets to be duplicated are obtained, the entity data sets to be managed are duplicated in the corresponding target answer sets to be managed, the entity data alignment processing is carried out according to the entity data sets to be managed and the target answer pairs to be managed, the target answer pair sets to be entity aligned are obtained, similarity judgment is carried out according to the target answer pair sets to be suspected to be similar, the target answer pair sets to be not similar and the dissimilar target answer pair sets are obtained, the target answer pair sets to be dissimilar are updated into the target knowledge base, the attribute duplication processing and the attribute value duplication processing are carried out on the target answer pair sets to be treated, the target answer pair sets to be suspected to be similar and the dissimilar target answer pair sets, and the attribute value duplication processing are carried out, and the quality of the target answer sets to be not required to be equal to each other, and the technical problem is solved after the target answer sets are repeated, and the technical answer values are required to be equal to be treated, and the technical problem is solved continuously exists in the knowledge base is solved, and the technical problem is solved, and the answer is required to be satisfied.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present application and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application or directly or indirectly applied to other related technical fields are included in the scope of the application.

Claims

1. A method for managing questions and answers in a knowledge base, the method comprising:

performing attribute duplication elimination processing and attribute value duplication elimination processing on the suspected similar target question-answer pair set and the incompletely similar target question-answer pair set to obtain a duplicated target question-answer pair set, and updating the duplicated target question-answer pair set into the target knowledge base;

And the step of judging the similarity according to the entity aligned question and answer pair sets to obtain a suspected similar question and answer pair set, an incomplete similar question and answer pair set and an dissimilar question and answer pair set, comprises the following steps:

2. The method for managing the questions and answers in the knowledge base according to claim 1, wherein before the step of performing entity recognition on the input entity recognition model by the questions and answers to be managed to obtain the entity data set to be deduplicated corresponding to the questions and answers to be managed, further comprises:

3. The method for managing the questions and answers in the knowledge base according to claim 1, wherein the step of performing the entity data alignment processing according to the de-duplicated entity data set and the plurality of questions and answers to be managed to obtain the entity aligned question and answer pair set comprises the steps of:

4. The method for managing the questions and answers of the knowledge base according to claim 1, wherein the step of performing similarity judgment according to the question and answer pair set, the question similarity matrix, and the answer similarity matrix after the entity alignment to obtain the suspected similar question and answer pair set, the incomplete similar question and answer pair set, and the dissimilar question and answer pair set comprises the steps of:

acquiring a second similarity threshold;

5. The method for managing the questions and answers in the knowledge base according to claim 1, wherein the steps of performing attribute deduplication processing and attribute value deduplication processing on the suspected similar question and answer pair set and the incomplete similar question and answer pair set to obtain the deduplicated question and answer pair set comprise:

6. The method for managing the questions and answers in the knowledge base according to claim 5, wherein the steps of searching and deleting the question text data with the attribute number equal to 0 for the suspected similar question and answer pair set and the incomplete similar question and answer pair set respectively, and obtaining the optimized suspected similar question and answer pair set and the optimized incomplete similar question and answer pair set comprise:

7. A device for managing questions and answers in a knowledge base, the device comprising:

the similarity judging module is used for dividing the matched standard question and answer pair sets of the entities by adopting text categories and entity data to obtain a plurality of standard question text data and entity subsets and a plurality of standard answer text data and entity subsets;

Performing similarity judgment according to the entity aligned question and answer pair set, the question similarity matrix and the question similarity matrix to obtain a suspected similar question and answer pair set, an incomplete similar question and answer pair set and an dissimilar question and answer pair set, and updating the dissimilar question and answer pair set into a target knowledge base;

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.