CN112541054A

CN112541054A - Method, device, equipment and storage medium for governing questions and answers of knowledge base

Info

Publication number: CN112541054A
Application number: CN202011479831.3A
Authority: CN
Inventors: 李骁; 赖众程; 高洪喜; 倪佳; 许海金; 李筱艺; 何凤连; 林志超; 高静; 李会璟; 史文鑫; 张舒婷; 李林毅
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-03-23
Anticipated expiration: 2040-12-15
Also published as: CN112541054B

Abstract

The application relates to the technical field of artificial intelligence, and discloses a method, a device, equipment and a storage medium for governing a question and a label of a knowledge base, wherein the method comprises the following steps: carrying out entity identification, entity data deduplication processing and entity data alignment processing on a plurality of question and answer pairs to be treated to obtain a question and answer pair set after entity alignment; according to the similarity judgment of the mark-question and answer pair sets after entity alignment, a suspected similar mark-question and answer pair set, an incomplete similar mark-question and answer pair set and a dissimilar mark-question and answer pair set are obtained, and the dissimilar mark-question and answer pair set is updated into a target knowledge base; and performing attribute duplication elimination processing and attribute value duplication elimination processing on the suspected similar question and question answer pair set and the incompletely similar question and question answer pair set to obtain a duplicated question and question answer pair set, and updating the duplicated question and question answer pair set into a target knowledge base. The quality of the knowledge base is improved, continuous manual work is not needed to participate in the treatment process, and the treatment efficiency is improved.

Description

Method, device, equipment and storage medium for governing questions and answers of knowledge base

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for governing a question and a label of a knowledge base.

Background

The standard questions and standard answers in the knowledge base are generally paired, and are generally called question-answer pairs. The question and answer pair is the core structure of the knowledge base knowledge system, the number of the standard questions directly influences the number of the similar questions, and then influences the capability of the question and answer robot. The questions and answers of the knowledge base in the prior art have the following problems: (1) knowledge duplication between standard questions; (2) a standard question has multiple intentions or no intentions; (3) the number of intentions of the standard question and the number of values of the standard answer are not equal, resulting in incomplete answers or excessive answers.

Disclosure of Invention

The application mainly aims to provide a method, a device, equipment and a storage medium for treating questions and answers of a knowledge base, and aims to solve the technical problems that in the prior art, knowledge of knowledge base among standard questions is repeated, one standard question has multiple intentions or no intentions, and the number of intentions of the standard question is not equal to the number of values of the standard answers, so that the answers are incomplete or too many.

In order to achieve the above object, the present application provides a method for governing a question and a question in a knowledge base, the method comprising:

obtaining a plurality of question mark-and-answer pairs to be administered, wherein the question mark-and-answer pairs to be administered comprise: the question marking text data to be treated and the answer marking text data to be treated;

performing entity recognition on the multiple question and answer pairs to be treated input entity recognition models to obtain entity data sets to be de-duplicated corresponding to the multiple question and answer pairs to be treated, wherein the entity recognition models are models obtained based on a pre-training model bert _ this and CRF network training;

carrying out entity data deduplication processing on the entity data sets to be deduplicated to obtain multiple deduplicated entity data sets corresponding to the question and answer pairs to be treated;

performing entity data alignment treatment according to the de-duplicated entity data set and the plurality of question and answer pairs to be treated to obtain a question and answer pair set after entity alignment;

according to the similarity judgment of the mark-question and mark-answer pair sets after entity alignment, a suspected similar mark-question and mark-answer pair set, an incomplete similar mark-question and answer pair set and a dissimilar mark-question and answer pair set are obtained, and the dissimilar mark-question and answer pair set is updated into a target knowledge base;

and performing attribute duplication elimination processing and attribute value duplication elimination processing on the suspected similar question and question pair sets and the incompletely similar question and question pair sets to obtain a duplicated question and question pair set, and updating the duplicated question and question pair set into the target knowledge base.

Further, before the step of inputting the multiple question and answer pairs to be administered into an entity recognition model for entity recognition to obtain the corresponding entity data sets to be deduplicated of the multiple question and answer pairs to be administered, the method further includes:

obtaining a plurality of training samples, the training samples comprising: text sample data and text sample calibration data to be trained;

dividing the training samples according to a preset division rule to obtain a training set and a verification set;

training a first model to be trained by adopting the training set, determining the trained first model to be the first model to be verified, wherein the first model to be trained is a model obtained based on the pre-training model bert _ this and the CRF network;

and verifying the first model to be verified by adopting the verification set, and determining the first model to be verified as the entity identification model when the verification is successful.

Further, the step of performing entity data alignment processing according to the deduplicated entity data set and the multiple question and answer pairs to be treated to obtain a question and answer pair set after entity alignment includes:

carrying out pairwise similarity calculation on all entity data in the entity data set after duplication removal by adopting a minimum edit distance calculation method to obtain an entity similarity matrix corresponding to the multiple question and answer pairs to be treated;

extracting entity similarity from the entity similarity matrixes corresponding to the multiple question and answer pairs to be treated according to columns to obtain multiple entity similarity sets to be optimized;

acquiring a first similarity threshold, and respectively acquiring entity similarities which are greater than the first similarity threshold from each entity similarity set to be optimized to obtain entity similarity sets to be aligned, which correspond to the entity similarity sets to be optimized;

obtaining entity data sets to be screened corresponding to the entity similarity sets to be aligned respectively according to the entity similarity set to be aligned and the de-duplicated entity data set corresponding to each entity similarity set to be optimized;

respectively obtaining entity data with the most characters in the entity data sets to be screened corresponding to each entity similarity set to be aligned to obtain optimal entity data corresponding to the entity similarity sets to be aligned;

and replacing the plurality of question and question mark-and-answer pairs to be treated by adopting the optimal entity data corresponding to each entity similarity set to be aligned to obtain the question and question mark-and-answer pair set after the entities are aligned.

Further, the step of performing similarity judgment on the challenge label answer pair set after the entity alignment to obtain a suspected similar challenge label answer pair set, an incomplete similar challenge label answer pair set, and a dissimilar challenge label answer pair set includes:

dividing the question and answer pair set after the entity alignment by adopting the text category and the entity data to obtain a plurality of question text data and entity subsets and a plurality of question text data and entity subsets;

performing pairwise similarity calculation on the question text data after the question text data are aligned with the entities in the entity subsets respectively by adopting a cosine similarity calculation method to obtain a question similarity matrix corresponding to each of the entity subsets and the question text data;

performing pairwise similarity calculation on the answering text data after the answering text data are aligned with the entities in the entity subsets respectively by adopting a cosine similarity calculation method to obtain answering similarity matrixes corresponding to the entity subsets and the answering text data respectively;

and performing similarity judgment according to the mark-question mark-answer pair set, the mark-question similarity matrix and the mark-question similarity matrix after the entities are aligned to obtain the suspected similar mark-question mark-answer pair set, the incomplete similar mark-question mark-answer pair set and the dissimilar mark-question mark-answer pair set.

Further, the step of obtaining the suspected similar challenge label-response pair set, the incomplete similar challenge label-response pair set and the dissimilar challenge label-response pair set by performing similarity judgment according to the challenge label-response pair set, the challenge similarity matrix and the challenge similarity matrix after entity alignment includes:

extracting the mark similarity from the mark similarity matrix according to columns to obtain a plurality of mark similarity sets to be optimized;

extracting the label-answer similarity from the label-answer similarity matrix according to columns to obtain a plurality of label-answer similarity sets to be optimized;

acquiring a second similarity threshold;

extracting the to-be-optimized question similarity and the to-be-optimized question-answer similarity corresponding to each entity-aligned question-answer pair in the entity-aligned question-answer pair set from the to-be-optimized question-similarity sets and the to-be-optimized answer similarity sets respectively;

when the similarity of the questions to be optimized and the similarity of the questions to be optimized, which correspond to the mark-question mark-answer pairs of the mark-question mark-answer pair set after the entity alignment, are greater than the second similarity threshold, determining the mark-question mark-answer pair set which is suspected to be similar according to the mark-question mark-answer pairs after the entity alignment;

when any one of the similarity of the questions to be optimized and the similarity of the questions to be optimized corresponding to the question-and-answer pairs of the question-and-answer pair set after entity alignment is smaller than or equal to the second similarity threshold, determining a question-and-answer pair set to be distinguished according to the question-and-answer pairs after entity alignment;

when the attribute quantity of the question text data of the question mark-answer pair aligned with the entity in the question mark-answer pair set to be distinguished and the attribute quantity of the question text data are both equal to 1, determining the dissimilar question mark-answer pair set according to the question mark-answer pair aligned with the entity;

and when one of the attribute quantity of the question text data and the attribute quantity of the question text data of the question mark-answer pair after the entity alignment in the question mark-answer pair set to be distinguished is not equal to 1, determining the incompletely similar question mark-answer pair set according to the question mark-answer pair after the entity alignment.

Further, the step of performing attribute deduplication processing and attribute value deduplication processing on the suspected similar question and question pair set and the incompletely similar question and question pair set to obtain a deduplicated question and question pair set includes:

searching and deleting the questioning text data with the attribute quantity equal to 0 respectively for the suspected similar questioning and asking answering pair set and the incomplete similar questioning and asking answering pair set to obtain an optimized suspected similar questioning and asking answering pair set and an optimized incomplete similar questioning and asking answering pair set;

sequentially and respectively extracting the question mark-answer pairs from the optimized suspected similar question mark-answer pair set and the optimized incomplete similar question mark-answer pair set to obtain question mark-answer pairs to be deduplicated;

when the attribute quantity of the question text data in the question mark-answer pair to be deduplicated is equal to 1, taking the question mark-answer pair to be deduplicated as a question mark-answer pair after deduplication;

when the attribute quantity of the question text data in the question-answering pair to be deduplicated is greater than 1, or the attribute value of the question text data in the question-answering pair to be deduplicated is greater than 1, performing attribute separation and attribute value separation on the question-answering pair to be deduplicated to obtain a plurality of single-attribute-value question text data and a plurality of single-attribute-value question text data;

carrying out identical attribute duplicate removal processing and identical attribute value duplicate removal processing on the single attribute value question text data and the single attribute value answer text data by adopting a principle of reserving a longest character to obtain a single attribute value question text data set after the duplicate removal processing and a single attribute value answer text data set after the duplicate removal processing;

matching each single-attribute value question text data in the single-attribute value question text data set subjected to the de-duplication processing with each single-attribute value question text data in the single-attribute value question text data set subjected to the de-duplication processing to obtain a plurality of question-answer pairs subjected to de-duplication processing;

and determining the question and mark-answer pair set after the duplication removal according to all the question and mark-answer pairs after the duplication removal.

Further, the step of performing search and deletion of the question text data with the attribute quantity equal to 0 on the suspected similar question-and-answer pair set and the incomplete similar question-and-answer pair set respectively to obtain an optimized suspected similar question-and-answer pair set and an optimized incomplete similar question-and-answer pair set includes:

finding out the question text data with the attribute quantity equal to 0 from all the question text data in the suspected similar question and answer pair set to obtain first question text data to be deleted;

deleting the first question text data to be deleted from the suspected similar question-and-answer pair set to obtain the optimized suspected similar question-and-answer pair set;

finding out the answering text data with the attribute quantity equal to 0 from all the answering text data in the incompletely similar question and answer pair set to obtain first answering text data to be deleted;

and deleting the first to-be-deleted question and answer text data from the incomplete similar question and answer pair set to obtain the optimized incomplete similar question and answer pair set.

This application has still provided the device of administering of the mark answer of asking of a knowledge base, the device includes:

the data acquisition module is used for acquiring a plurality of question mark-answer pairs to be administered, wherein the question mark-answer pairs to be administered comprise: the question marking text data to be treated and the answer marking text data to be treated;

the entity recognition module is used for performing entity recognition on the input entity recognition models of the multiple question and answer pairs to be treated to obtain entity data sets to be deduplicated corresponding to the multiple question and answer pairs to be treated, and the entity recognition model is a model obtained based on a pre-training model bert _ this and CRF network training;

the entity data duplicate removal processing module is used for carrying out entity data duplicate removal processing on the entity data sets to be subjected to duplicate removal processing to obtain multiple question and answer pairs to be treated, wherein the multiple question and answer pairs correspond to the entity data sets subjected to duplicate removal processing;

the entity data alignment processing module is used for carrying out entity data alignment processing according to the de-duplicated entity data set and the multiple question and answer pairs to be treated to obtain a question and answer pair set after entity alignment;

the similarity judgment module is used for carrying out similarity judgment on the mark-question and answer pair sets after the entities are aligned to obtain a suspected similar mark-question and answer pair set, an incomplete similar mark-question and answer pair set and a dissimilar mark-question and answer pair set, and updating the dissimilar mark-question and answer pair set into a target knowledge base;

and the attribute de-duplication processing and attribute value de-duplication processing module is used for performing attribute de-duplication processing and attribute value de-duplication processing on the suspected similar question and answer pair set and the incomplete similar question and answer pair set to obtain a de-duplicated question and answer pair set, and updating the de-duplicated question and answer pair set into the target knowledge base.

The present application further proposes a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.

The present application also proposes a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.

The method, the device, the equipment and the storage medium for governing the question and mark answers of the knowledge base are characterized in that a plurality of question and mark answers to be governed are input into an entity recognition model for entity recognition to obtain a plurality of entity data sets to be deduplicated corresponding to the question and mark answers to be governed, entity data deduplication processing is carried out on the entity data sets to be deduplicated to obtain a plurality of entity data sets to be governed corresponding to the question and mark answers to be governed, entity data alignment processing is carried out according to the entity data sets to be governed and the plurality of question and mark answers to be governed to obtain a question and mark answer pair set after entity alignment, similarity judgment is carried out on the question and mark answer pair set after entity alignment to obtain a question and mark answer pair set which is suspected to be similar, a question and mark and answer pair set which is not completely similar, and the dissimilar question and mark answer pair set is updated into a target knowledge base, attribute duplication elimination processing and attribute value duplication elimination processing are carried out on the suspected similar question and answer pair sets and the incompletely similar question and answer pair sets to obtain the duplicated question and answer pair sets, and the duplicated question and answer pair sets are updated into the target knowledge base, so that technical problems that knowledge of knowledge stocks in the prior art is repeated among standard problems, one standard problem has multiple intentions or no intentions, and the number of the intentions of the standard problem is not equal to the number of the values of the standard answers to cause incomplete answers or excessive answers can not occur in the target knowledge base, the quality of the knowledge base is improved, continuous manual participation in a treatment process is not needed, and the treatment efficiency is improved.

Drawings

FIG. 1 is a schematic flow chart of a method for governing questions and answers in a knowledge base according to an embodiment of the present application;

FIG. 2 is a block diagram schematically illustrating the structure of a query and response governance device of the knowledge base according to an embodiment of the present disclosure;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The objectives, features, and advantages of the present application will be further described with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In order to solve the technical problems that the knowledge of the knowledge base in the prior art is repeated among standard problems, one standard problem has a plurality of intentions or no intentions, the number of the intentions of the standard problem is not equal to the number of the values of the standard answers, and the answers are incomplete or too many, the application provides a method for treating the questions asked and the answers of the knowledge base, and the method is applied to the technical field of artificial intelligence. The method for treating the question mark-answer of the knowledge base comprises the steps of carrying out entity identification, duplication elimination processing of entity data and entity data alignment processing, then carrying out similar text data judgment to obtain suspected similar, incomplete similar and dissimilar question mark-answer pairs, carrying out attribute duplication elimination processing and attribute value duplication elimination processing on the suspected similar and incomplete similar question mark-answer pairs, and updating the dissimilar question mark-answer pairs, the attribute duplication elimination processing and the attribute value duplication elimination processed question mark-answer pairs into the target knowledge base, so that technical problems of incomplete answers or excessive answers caused by the fact that knowledge of knowledge stocks in the prior art is duplicated among standard problems, one standard problem has a plurality of intentions or no intentions, and the number of intentions of the standard problems is not equal to the number of values of the standard answers, are avoided, and the quality of the knowledge base is improved, continuous manual work is not needed to participate in the treatment process, and the treatment efficiency is improved.

Referring to fig. 1, in an embodiment of the present application, a method for governing a question and a question of a knowledge base is provided, where the method includes:

s1: obtaining a plurality of question mark-and-answer pairs to be administered, wherein the question mark-and-answer pairs to be administered comprise: the question marking text data to be treated and the answer marking text data to be treated;

s2: performing entity recognition on the multiple question and answer pairs to be treated input entity recognition models to obtain entity data sets to be de-duplicated corresponding to the multiple question and answer pairs to be treated, wherein the entity recognition models are models obtained based on a pre-training model bert _ this and CRF network training;

s3: carrying out entity data deduplication processing on the entity data sets to be deduplicated to obtain multiple deduplicated entity data sets corresponding to the question and answer pairs to be treated;

s4: performing entity data alignment treatment according to the de-duplicated entity data set and the plurality of question and answer pairs to be treated to obtain a question and answer pair set after entity alignment;

s5: according to the similarity judgment of the mark-question and mark-answer pair sets after entity alignment, a suspected similar mark-question and mark-answer pair set, an incomplete similar mark-question and answer pair set and a dissimilar mark-question and answer pair set are obtained, and the dissimilar mark-question and answer pair set is updated into a target knowledge base;

s6: and performing attribute duplication elimination processing and attribute value duplication elimination processing on the suspected similar question and question pair sets and the incompletely similar question and question pair sets to obtain a duplicated question and question pair set, and updating the duplicated question and question pair set into the target knowledge base.

The embodiment performs entity recognition on a plurality of question mark-answer pairs to be treated input entity recognition models to obtain a plurality of entity data sets to be deduplicated corresponding to the question mark-answer pairs to be treated, performs entity data deduplication processing on the entity data sets to be deduplicated to obtain a plurality of entity data sets to be deduplicated corresponding to the question mark-answer pairs to be treated, performs entity data alignment processing according to the entity data sets to be deduplicated and the plurality of question mark-answer pairs to be treated to obtain a question mark-answer pair set after entity alignment, performs similarity judgment according to the question mark-answer pair set after entity alignment to obtain a suspected similar question mark-answer pair set, an incompletely similar question mark-answer pair set and an dissimilar question mark-answer pair set, updates the dissimilar question mark-answer pair set into a target knowledge base, performs attribute deduplication processing and attribute value deduplication processing on the suspected similar question mark-answer pair set and the incompletely similar question mark-answer pair set, the method comprises the steps of obtaining a question-answering pair set after duplication removal, and updating the question-answering pair set after duplication removal into a target knowledge base, so that the technical problems that knowledge of knowledge stocks in the prior art is repeated among standard problems, one standard problem has multiple intentions or no intentions, the number of intentions of the standard problem is not equal to the number of values of the standard answers, and the answers are incomplete or too many do not occur in the target knowledge base, the quality of the knowledge base is improved, continuous manual participation in a treatment process is not needed, and the treatment efficiency is improved.

Corresponding to S1, multiple question mark-and-answer pairs to be administered may be obtained from the database, multiple question mark-and-answer pairs to be administered input by the user, or multiple question mark-and-answer pairs to be administered sent by the third-party application system.

The question-marking answer pairs to be treated are question-marking answer pairs which need to be treated to avoid repeated knowledge among standard questions, multiple intentions or no intentions exist in one standard question, and incomplete answers or excessive answers are caused by unequal number of intentions of the standard questions and the value number of the standard answers.

The question text data to be treated is the question text data which needs to be treated to avoid repeated knowledge among standard questions, multiple intentions or no intentions exist in one standard question, and incomplete answers or excessive answers are caused by unequal number of intentions of the standard question and the value number of the standard answers. The question text data is text data describing a question.

The answer text data to be treated is the answer text data which needs to be treated to avoid repeated knowledge among the standard questions, multiple intentions or no intentions exist in one standard question, and incomplete answers or excessive answers are caused by unequal number of the intentions of the standard questions and the value number of the standard answers. The answer text data is text data describing an answer.

In the same question and answer pair to be treated, the question and answer text data to be treated is the text data of the answer aiming at the question and answer text data to be treated.

And corresponding to the step S2, inputting the question text data to be treated and the question text data to be treated of each question-answer pair to be treated of the plurality of question-answer pairs to be treated into an entity identification model for entity identification, and taking all the entity data output by the entity identification model as an entity data set.

The pre-training model bert _ this is a model obtained based on the bert-base-Chinese network training. The method of the pre-training model bert _ this obtained based on the bert-base-chip network training can be selected from the prior art, and is not described herein again.

The method for obtaining the entity recognition model based on the pre-training models bert _ this and CRF (conditional random field) network training can be selected from the prior art, and is not described herein again.

And correspondingly S3, performing deduplication processing on all entity data in the entity data set to be deduplicated, and taking the entity data set to be deduplicated as the multiple question answering pairs to be treated and the corresponding deduplicated entity data sets. That is to say, in the entity data set after the duplication removal corresponding to the multiple question and answer pairs to be treated, each entity data has uniqueness.

Entity data refers to entities in a triplet (an entity is an abstraction of an objective individual, and a person, a movie, or a sentence can be considered as an entity).

Corresponding to S4, determining the optimal entity data according to the entity similarity among the entity data in the entity data set after the duplication removal; and replacing the plurality of question mark-and-answer pairs to be treated by adopting optimal entity data, and taking the replaced plurality of question mark-and-answer pairs to be treated as the question mark-and-answer pair set after the entities are aligned.

And corresponding to the step S5, performing similarity judgment according to the entity data of the question text data and the entity data of the question text data of the entity-aligned question-and-answer pair set, and dividing the entity-aligned question-and-answer pair set into a suspected similar question-and-answer pair set, an incompletely similar question-and-answer pair set and a dissimilar question-and-answer pair set according to a similarity judgment result. The dissimilar ask-and-ask answer pair sets do not have knowledge duplication of knowledge stocks in the prior art among standard questions, one standard question has multiple intentions or no intentions, and the number of the intentions of the standard question is not equal to the number of the values of the standard answers, so that the problems of incomplete answers or excessive answers are caused, and the dissimilar ask-and-ask answer pair sets can be directly updated into the target knowledge base. The target knowledge base is a knowledge base which is newly established after treatment, so that the data in the target knowledge base does not have the problems that the knowledge of the knowledge base in the prior art is repeated among standard questions, one standard question has a plurality of intentions or no intentions, and the number of the intentions of the standard question is not equal to the number of the values of the standard answers, so that the answers are incomplete or excessive.

And corresponding to the step S6, performing attribute deduplication processing and attribute value deduplication processing on the question text data and the question text data of the suspected similar question-and-answer pair set and the incompletely similar question-and-answer pair set, and determining a deduplicated question-and-answer pair set according to the suspected similar question-and-answer pair set and the incompletely similar question-and-answer pair set after the attribute deduplication processing and the attribute value deduplication processing. The repeated asked question and answer pair set does not have knowledge duplication of knowledge stocks in the prior art among standard questions, one standard question has multiple intentions or no intentions, and the number of the intentions of the standard question is not equal to the number of the values of the standard answers, so that the problem of incomplete answers or excessive answers is caused, and the repeated asked question and answer pair set can be directly updated into a target knowledge base.

It is understood that the target knowledge base is initially empty, and the question and answer pairs are added to the target knowledge base through the steps S5 and S6.

In an embodiment, before the step of inputting the multiple question-and-answer pairs to be administered into an entity identification model for entity identification to obtain the entity data sets to be deduplicated corresponding to the multiple question-and-answer pairs to be administered, the method further includes:

s021: obtaining a plurality of training samples, the training samples comprising: text sample data and text sample calibration data to be trained;

s022: dividing the training samples according to a preset division rule to obtain a training set and a verification set;

s023: training a first model to be trained by adopting the training set, determining the trained first model to be the first model to be verified, wherein the first model to be trained is a model obtained based on the pre-training model bert _ this and the CRF network;

s024: and verifying the first model to be verified by adopting the verification set, and determining the first model to be verified as the entity identification model when the verification is successful.

The embodiment realizes the entity recognition model obtained based on the pre-training model bert _ this and the CRF network training, and provides a basis for entity recognition of a plurality of question and answer pairs to be treated.

Corresponding to S021, a plurality of training samples may be obtained from the database, or a plurality of training samples input by the user, or a plurality of training samples sent by the third-party application system.

Optionally, the training samples are training samples obtained based on an unbundled knowledge base.

Each training sample comprises a text sample data to be trained and a text sample calibration data. The text sample data to be trained can be question text data in an unprocessed knowledge base or answer text data in the unprocessed knowledge base.

In the same training sample, the text sample calibration data is the calibration result of entity data identification of the text sample data to be trained.

Corresponding to the step S022, the training samples in the training samples are divided into a training set or a verification set according to a preset division rule, that is, the training samples in the training set and the verification set are not repeated.

Optionally, the preset division rule is that 80% of the preset division rule is divided into a training set, and 20% of the preset division rule is divided into a verification set. It is understood that the preset partitioning rule may also be other rules, and is not specifically limited herein.

Corresponding to S023, training a first model to be trained by adopting the training samples in the training set, wherein a cross entropy loss function is adopted as a loss function, an Adam optimizer is adopted, the learning rate of the Adam optimizer is set to be 1e-5, a measurement method is set to be an accurate rate, and the first model to be trained after training is determined to be the first model to be verified.

The first model to be trained comprises: a bert-base-chip module, a CRF module. The bert-base-chip module is trained to obtain a bert _ this module.

The batch of the bert-this module is 64, the learning rate is 3e-5, the number of training steps is 50000, and the number of learning rate optimization method (i.e., warmup) steps is 5000.

Corresponding to S024, the method for verifying the first model to be verified using the verification set may be selected from the prior art, and is not described herein again.

And when the verification is successful, determining that the first model to be verified is the entity identification model, otherwise, re-executing the steps S023 to S024 until the verification is successful.

In an embodiment, the step of performing entity data alignment processing according to the deduplicated entity data set and the multiple question and answer pairs to be treated to obtain a question and answer pair set after entity alignment includes:

s41: carrying out pairwise similarity calculation on all entity data in the entity data set after duplication removal by adopting a minimum edit distance calculation method to obtain an entity similarity matrix corresponding to the multiple question and answer pairs to be treated;

s42: extracting entity similarity from the entity similarity matrixes corresponding to the multiple question and answer pairs to be treated according to columns to obtain multiple entity similarity sets to be optimized;

s43: acquiring a first similarity threshold, and respectively acquiring entity similarities which are greater than the first similarity threshold from each entity similarity set to be optimized to obtain entity similarity sets to be aligned, which correspond to the entity similarity sets to be optimized;

s44: obtaining entity data sets to be screened corresponding to the entity similarity sets to be aligned respectively according to the entity similarity set to be aligned and the de-duplicated entity data set corresponding to each entity similarity set to be optimized;

s45: respectively obtaining entity data with the most characters in the entity data sets to be screened corresponding to each entity similarity set to be aligned to obtain optimal entity data corresponding to the entity similarity sets to be aligned;

s46: and replacing the plurality of question and question mark-and-answer pairs to be treated by adopting the optimal entity data corresponding to each entity similarity set to be aligned to obtain the question and question mark-and-answer pair set after the entities are aligned.

According to the embodiment, the entity data alignment processing is carried out according to the entity similarity among the entity data in the entity data set after the duplication is removed, and the entity data normalization of the question and question mark and answer pair set after the entity alignment is favorably improved, so that the improvement quality of the question and question mark and answer treatment is favorably improved.

Corresponding to S41, the calculation formula of entity similarity is:

wherein, similarity is the entity similarity of the entity similarity matrix corresponding to the multiple question and answer pairs to be treated, ED_ABIs the minimum edit distance, L, between entity data A and entity data B_AIs the character length, L, of the entity data A_BMax () is a function of extracting the maximum value for the character length of the entity data B.

And if the number of entities in the entity data set after the duplication removal is EN, the number of rows and the number of columns of the entity similarity matrix corresponding to the multiple question mark-answer pairs to be treated are EN, and each element in the entity similarity matrix represents the similarity between the entity data corresponding to the row number and the entity data corresponding to the column number. For example, the element in the 3 rd row and the 5 th column of the entity similarity matrix is the similarity between the entity data corresponding to the 3 rd row and the entity data corresponding to the 5 th column, which is not specifically limited in this example.

Optionally, the row numbers and column numbers with the same value in the entity similarity matrix correspond to corresponding entity data. For example, the entity data of the 3 rd row corresponding to the element of the 3 rd row and the 3 rd column of the entity similarity matrix is the same as the entity data of the 3 rd column, which is not specifically limited in this example.

And corresponding to S42, extracting entity similarity from the entity similarity matrix corresponding to the question and answer pairs to be administered according to columns, namely taking each column of elements in the entity similarity matrix as an entity similarity set to be optimized.

It can be understood that the number of entities in the deduplicated entity data set is EN, and the number of entity similarity sets to be optimized in the plurality of entity similarity sets to be optimized is also EN.

It can be understood that, in another embodiment, the entity similarity may also be extracted from the entity similarity matrix corresponding to the question and answer pairs to be treated by rows, respectively, to obtain a plurality of entity similarity sets to be optimized, and the method of extracting the entity similarity by rows also has the same effect as the method of extracting the entity similarity by columns.

Corresponding to S43, the first similarity threshold may be obtained from the database, or may be the first similarity threshold input by the user, or may be the first similarity threshold sent by the third-party application system. It will be appreciated that the first similarity threshold may also be written in a program file implementing the present application.

Acquiring an entity similarity set to be optimized from a plurality of entity similarity sets to be optimized to obtain a target entity similarity set to be optimized; comparing each entity similarity in the target entity similarity set to be optimized with a first similarity threshold, taking the entity similarity which is greater than the first similarity threshold in the target entity similarity set to be optimized as the entity similarity to be aligned corresponding to the target entity similarity set to be optimized, and taking all the entity similarities to be aligned corresponding to the target entity similarity set to be optimized as the entity similarity set to be aligned corresponding to the target entity similarity set to be optimized; and repeating the step of obtaining an entity similarity set to be optimized from the entity similarity sets to be optimized to obtain a target entity similarity set to be optimized until the entity similarity sets to be aligned corresponding to the entity similarity sets to be optimized are determined.

Corresponding to S44, extracting one entity similarity set to be aligned from the entity similarity sets to be aligned corresponding to the plurality of entity similarity sets to be optimized, as a target entity similarity set to be aligned; searching entity data in the entity similarity set to be aligned in the de-duplicated entity data set, taking the entity data searched in the de-duplicated entity data set as the entity data to be screened corresponding to the entity similarity set to be aligned, and taking all the entity data to be screened corresponding to the entity similarity set to be aligned as the entity data set to be screened corresponding to the entity similarity set to be aligned; and repeatedly executing the step of extracting one entity similarity set to be aligned from the entity similarity sets to be aligned corresponding to the entity similarity sets to be optimized as the entity similarity set to be aligned, until the entity data sets to be screened corresponding to the entity similarity sets to be aligned are determined.

For example, in step S42, when extracting by column, the entity similarity with row number 21 and column number 32 in the target entity similarity set to be aligned, and the entity data corresponding to the de-duplicated entity data set with row number 21 and column number 32 of the entity similarity in the target entity similarity set to be aligned is used as the entity data to be filtered with row number 21 and column number 32 of the entity similarity in the target entity similarity set to be aligned, which is not specifically limited in this example.

Corresponding to S45, extracting one entity similarity set to be aligned from the entity data sets to be screened corresponding to the entity similarity sets to be aligned as a target entity similarity set to be aligned; acquiring entity data with the most characters from an entity data set to be screened corresponding to the target entity similarity set to be aligned to obtain optimal entity data corresponding to the target entity similarity set to be aligned; and repeatedly executing the step of extracting one entity similarity set to be aligned from the entity data sets to be screened, which correspond to the entity similarity sets to be aligned respectively, as the target entity similarity set to be aligned until the optimal entity data corresponding to the entity similarity sets to be aligned respectively is determined.

Corresponding to S46, extracting one to-be-aligned entity similarity set from the optimal entity data corresponding to each of the plurality of to-be-aligned entity similarity sets as a target to-be-aligned entity similarity set; taking the entity data corresponding to the entity similarity set to be aligned as the entity data set to be replaced corresponding to the entity similarity set to be aligned; replacing the plurality of question and answer pairs to be treated by adopting the entity data set to be replaced and the optimal entity data corresponding to the entity similarity set to be aligned; and repeatedly executing the step of extracting one entity similarity set to be aligned from the optimal entity data corresponding to the entity similarity sets to be aligned as a target entity similarity set to be aligned until the optimal entity data corresponding to all the entity similarity sets to be aligned replace the multiple question and answer pairs to be treated.

And replacing the plurality of question mark-answer pairs to be administered by using the entity data set to be replaced and the optimal entity data corresponding to the entity similarity set to be aligned, namely replacing the same entity data in the entity data set to be replaced corresponding to the entity similarity set to be aligned in the plurality of question mark-answer pairs to be administered by using the optimal entity data corresponding to the entity similarity set to be aligned.

In an embodiment, the step of obtaining the suspected similar challenge label-answer pair set, the incompletely similar challenge label-answer pair set, and the dissimilar challenge label-answer pair set according to the similarity judgment of the challenge label-answer pair sets after the entity alignment includes:

s51: dividing the question and answer pair set after the entity alignment by adopting the text category and the entity data to obtain a plurality of question text data and entity subsets and a plurality of question text data and entity subsets;

s52: performing pairwise similarity calculation on the question text data after the question text data are aligned with the entities in the entity subsets respectively by adopting a cosine similarity calculation method to obtain a question similarity matrix corresponding to each of the entity subsets and the question text data;

s53: performing pairwise similarity calculation on the answering text data after the answering text data are aligned with the entities in the entity subsets respectively by adopting a cosine similarity calculation method to obtain answering similarity matrixes corresponding to the entity subsets and the answering text data respectively;

s54: and performing similarity judgment according to the mark-question mark-answer pair set, the mark-question similarity matrix and the mark-question similarity matrix after the entities are aligned to obtain the suspected similar mark-question mark-answer pair set, the incomplete similar mark-question mark-answer pair set and the dissimilar mark-question mark-answer pair set.

The embodiment realizes the similarity judgment according to the entity data between the question text data and the answer text data of the question and answer pair set after the entity alignment, is beneficial to classifying the entity data between the question text data and the answer text data of the question and answer pair set after the entity alignment according to the similarity judgment result, and provides a basis for solving the technical problems that the knowledge repetition between standard problems, a standard problem has a plurality of intentions or no intentions, and the number of the intentions of the standard problem is not equal to the number of the values of the standard answers, so that the answers are not complete or too many.

Corresponding to S51, dividing the question text data set after entity alignment in the question and answer pair set after entity alignment according to entity data to obtain a plurality of question text data and entity subsets, where the same question text data and entity subsets have the same entity data. And dividing the entity-aligned question and answer text data set in the entity-aligned question and answer pair set according to entity data to obtain a plurality of entity subsets of the same question and answer text data, wherein the entity subsets of the same question and answer text data have the same entity data.

Corresponding to the step S52, taking any one of the multiple question text data and entity subsets as a target question text data and entity subset; inputting the target question text data aligned with each entity in the entity subset into a pre-training language model BERT _ this to perform flag bit vector prediction, and obtaining target first vectors corresponding to the target question text data and all the entity aligned question text data in the entity subset; and calculating the mark similarity of any two target first vectors in the target mark text data and the target first vectors corresponding to all the entity-aligned mark text data in the entity subset by adopting a cosine similarity calculation method to obtain a mark similarity matrix of the target mark text data corresponding to the entity subset.

The calculation formula of the cosine similarity cos (theta) in the cosine similarity calculation method is as follows:

wherein a represents a first of any two of the target first vectors, and b represents a second of any two of the target first vectors. a · b refers to the multiplication set of vector a and vector b, | a | | | is the modulus of the calculation vector a.

Corresponding to the step S53, taking any one of the plurality of the answering text data and entity subsets as a target answering text data and entity subset; inputting the target answering text data and the answering text data aligned with each entity in the entity subset into a pre-training language model BERT _ this to perform flag bit vector prediction, and obtaining target second vectors corresponding to the target answering text data and all the entity-aligned answering text data in the entity subset; and calculating the mark-answer similarity of any two target second vectors in the target second vectors corresponding to the mark-answer text data after the target mark-answer text data is aligned with all the entities in the entity subset by adopting a cosine similarity calculation method to obtain a mark-answer similarity matrix of the target mark-answer text data corresponding to the entity subset.

And corresponding to S54, performing similarity judgment according to the mark-question similarity matrix and the mark-answer similarity matrix to perform similarity judgment, classifying mark-question and mark-answer pairs in the mark-question and mark-answer pair set after the entities are aligned according to a similarity judgment result, and obtaining the mark-question and mark-answer pair set which is suspected to be similar, the mark-question and mark-answer pair set which is not completely similar and the dissimilar mark-answer pair set after the classification is finished.

In an embodiment, the step of obtaining the suspected similar challenge label-response pair set, the incompletely similar challenge label-response pair set, and the dissimilar challenge label-response pair set according to the similarity judgment of the challenge label-response pair set, the challenge similarity matrix, and the challenge similarity matrix after the entity alignment includes:

s541: extracting the mark similarity from the mark similarity matrix according to columns to obtain a plurality of mark similarity sets to be optimized;

s542: extracting the label-answer similarity from the label-answer similarity matrix according to columns to obtain a plurality of label-answer similarity sets to be optimized;

s543: acquiring a second similarity threshold;

s544: extracting the to-be-optimized question similarity and the to-be-optimized question-answer similarity corresponding to each entity-aligned question-answer pair in the entity-aligned question-answer pair set from the to-be-optimized question-similarity sets and the to-be-optimized answer similarity sets respectively;

s545: when the similarity of the questions to be optimized and the similarity of the questions to be optimized, which correspond to the mark-question mark-answer pairs of the mark-question mark-answer pair set after the entity alignment, are greater than the second similarity threshold, determining the mark-question mark-answer pair set which is suspected to be similar according to the mark-question mark-answer pairs after the entity alignment;

s546: when any one of the similarity of the questions to be optimized and the similarity of the questions to be optimized corresponding to the question-and-answer pairs of the question-and-answer pair set after entity alignment is smaller than or equal to the second similarity threshold, determining a question-and-answer pair set to be distinguished according to the question-and-answer pairs after entity alignment;

s547: when the attribute quantity of the question text data of the question mark-answer pair aligned with the entity in the question mark-answer pair set to be distinguished and the attribute quantity of the question text data are both equal to 1, determining the dissimilar question mark-answer pair set according to the question mark-answer pair aligned with the entity;

s548: and when one of the attribute quantity of the question text data and the attribute quantity of the question text data of the question mark-answer pair after the entity alignment in the question mark-answer pair set to be distinguished is not equal to 1, determining the incompletely similar question mark-answer pair set according to the question mark-answer pair after the entity alignment.

The embodiment realizes the similarity judgment according to the entity-aligned question-and-answer pair set, the question similarity matrix and the question-and-answer similarity matrix, is beneficial to classifying the entity data between the question text data and the question text data of the entity-aligned question-and-answer pair set according to the similarity, and provides a basis for solving the technical problems that the knowledge repetition between standard problems, the existence of a plurality of intentions or no intentions in one standard problem, and the number of intentions of the standard problem is not equal to the number of values of the standard answers, so that the answers are incomplete or too many.

Corresponding to S541, extracting the token similarity from the token similarity matrix by columns, that is, taking each column of elements in the token similarity matrix as a token similarity set to be optimized.

It can be understood that, in another embodiment, the question similarity may also be extracted from the question similarity matrix by rows to obtain a plurality of question similarity sets to be optimized, and the method of extracting the question similarity by rows also has the same effect as the method of extracting the question similarity by columns.

Corresponding to S542, extracting the label similarity from the label similarity matrix by columns, that is, taking each column of elements in the label similarity matrix as a label similarity set to be optimized.

It can be understood that, in another embodiment, the method of extracting the similarity of the answers by rows and the method of extracting the similarity of the answers by columns may also be the same as the method of extracting the similarity of the answers by rows to obtain a plurality of sets of similarity of the answers to be optimized.

Corresponding to S543, the second similarity threshold may be obtained from the database, or may be a second similarity threshold input by the user, or may be a second similarity threshold sent by the third-party application system. It will be appreciated that the second similarity threshold may also be written in a program file implementing the present application.

Corresponding to S544, taking the question mark-answer pair after any entity is aligned in the question mark-answer pair set after entity alignment as the question mark-answer pair after target entity alignment; and extracting the mark-question similarity from the mark-question and answer pairs after the target entities are aligned in the mark-question similarity sets to be optimized as the mark-question similarity corresponding to the mark-question and answer pairs after the target entities are aligned, and extracting the mark-answer similarity from the mark-question and answer pairs after the target entities are aligned in the mark-answer similarity sets to be optimized as the mark-answer similarity corresponding to the mark-question and answer pairs after the target entities are aligned.

Corresponding to S545, taking the mark and question answer pair after any entity is aligned in the mark and question answer pair set after the entities are aligned as a mark and question answer pair after a target entity is aligned; when the similarity of the questions to be optimized and the similarity of the questions to be optimized, which correspond to the mark-answer pairs after the target entities are aligned, are both greater than the second similarity threshold, it means that the mark-answer pairs after the target entities are aligned have higher similarity with other data in the mark-answer pairs after the entities are aligned, and the mark-answer pairs after the target entities are aligned are classified into the mark-answer pairs which are suspected to be similar.

Corresponding to S546, taking any aligned question mark-answer pair in the entity-aligned question mark-answer pair set as a target entity-aligned question mark-answer pair; when any one of the similarity of the to-be-optimized question and the similarity of the to-be-optimized question corresponding to the target entity-aligned question and answer pair is smaller than or equal to the second similarity threshold, it means that the target entity-aligned question and answer pair may be similar to other data in the entity-aligned question and answer pair set, and the target entity-aligned question and answer pair is classified into the to-be-distinguished question and answer pair set.

Corresponding to the step S547, taking the mark and question answer pair after any entity in the mark and question answer pair set to be distinguished is aligned as the mark and question answer pair after the entity to be distinguished is aligned; when the number of the attributes of the question text data and the number of the attributes of the question text data of the aligned question mark-answer pair of the entity to be distinguished are both equal to 1, the method means that the attributes of the aligned question mark-answer pair of the entity to be distinguished are single, attribute de-duplication processing is not needed, and the aligned question mark-answer pair of the entity to be distinguished is classified into the dissimilar question mark-answer pair set.

Corresponding to S548, taking the mark-question and answer pair after any entity in the mark-question and answer pair set to be distinguished is aligned as a mark-question and answer pair after the entities to be distinguished are aligned; when one of the number of attributes of the question text data and the number of attributes of the question text data of the entity-aligned question-and-answer pair to be distinguished is not equal to 1, the attribute in the entity-aligned question-and-answer pair to be distinguished is not single and needs to be subjected to attribute de-duplication processing, and at the moment, the entity-aligned question-and-answer pair to be distinguished is classified into the incompletely similar question-and-answer pair set.

In an embodiment, the above step of performing attribute deduplication processing and attribute value deduplication processing on the suspected similar question-and-answer pair set and the incompletely similar question-and-answer pair set to obtain a deduplicated question-and-answer pair set includes:

s61: searching and deleting the questioning text data with the attribute quantity equal to 0 respectively for the suspected similar questioning and asking answering pair set and the incomplete similar questioning and asking answering pair set to obtain an optimized suspected similar questioning and asking answering pair set and an optimized incomplete similar questioning and asking answering pair set;

s62: sequentially and respectively extracting the question mark-answer pairs from the optimized suspected similar question mark-answer pair set and the optimized incomplete similar question mark-answer pair set to obtain question mark-answer pairs to be deduplicated;

s63: when the attribute quantity of the question text data in the question mark-answer pair to be deduplicated is equal to 1, taking the question mark-answer pair to be deduplicated as a question mark-answer pair after deduplication;

s64: when the attribute quantity of the question text data in the question-answering pair to be deduplicated is greater than 1, or the attribute value of the question text data in the question-answering pair to be deduplicated is greater than 1, performing attribute separation and attribute value separation on the question-answering pair to be deduplicated to obtain a plurality of single-attribute-value question text data and a plurality of single-attribute-value question text data;

s65: carrying out identical attribute duplicate removal processing and identical attribute value duplicate removal processing on the single attribute value question text data and the single attribute value answer text data by adopting a principle of reserving a longest character to obtain a single attribute value question text data set after the duplicate removal processing and a single attribute value answer text data set after the duplicate removal processing;

s66: matching each single-attribute value question text data in the single-attribute value question text data set subjected to the de-duplication processing with each single-attribute value question text data in the single-attribute value question text data set subjected to the de-duplication processing to obtain a plurality of question-answer pairs subjected to de-duplication processing;

s67: and determining the question and mark-answer pair set after the duplication removal according to all the question and mark-answer pairs after the duplication removal.

The embodiment realizes attribute duplicate removal processing and attribute value duplicate removal processing, improves the quality of the treated knowledge base, does not need to continuously participate in the treatment process by manpower, and improves the treatment efficiency.

Corresponding to S61, finding out the question text data with the attribute number equal to 0 from the suspected similar question-and-answer pair set, and deleting the found question text data from the suspected similar question-and-answer pair set to obtain an optimized suspected similar question-and-answer pair set; and finding out the question text data with the attribute quantity equal to 0 from the incomplete similar question and answer pair set, deleting the found question text data from the incomplete similar question and answer pair set, and taking the deleted incomplete similar question and answer pair set as the optimized incomplete similar question and answer pair set. Thereby solving the problem that the standard problem has no intention.

Attributes in the attribute quantity refer to attributes in the triples (i.e., abstractions of entities and relationships between entities).

And corresponding to the step S62, extracting a question mark-answer pair from the optimized suspected similar question mark-answer pair set, extracting a question mark-answer pair from the optimized incompletely similar question mark-answer pair set, and taking each extracted question mark-answer pair as a question mark-answer pair to be deduplicated.

Corresponding to S63, when the number of attributes of the question text data in the question mark-answer pair to be deduplicated is equal to 1, it means that the question mark-answer pair to be deduplicated has a single attribute, and does not need to perform attribute deduplication processing, and at this time, the question mark-answer pair to be deduplicated is taken as a question mark-answer pair after deduplication.

Corresponding to S64, when the number of attributes of the question text data in the question-and-answer pair to be deduplicated is greater than 1, or the attribute value of the question text data in the question-and-answer pair to be deduplicated is greater than 1, it means that the attribute of the question-and-answer pair to be deduplicated is not single, and attribute deduplication processing needs to be performed, the question text data of the question-and-answer pair to be deduplicated is separated according to a single attribute to obtain multiple single attribute value question text data, and the question-and-answer text data of the question-and-answer pair to be deduplicated is separated according to a single attribute to obtain multiple single attribute value question text data. That is, the attribute in the single-attribute value question text data has uniqueness, and the attribute value has uniqueness. The attribute in the single-attribute value answering text data has uniqueness, and the attribute value has uniqueness.

Attribute values refer to the values of attributes in triples (values are used to describe entities and can be classified as textual and numeric).

Corresponding to the step S65, performing same attribute duplicate removal processing and same attribute value duplicate removal processing on the single attribute value question text data by adopting a principle of retaining the attribute longest character and a principle of retaining the attribute value longest character to obtain a single attribute value question text data set after the duplicate removal processing; and performing identical attribute duplicate removal processing and identical attribute value duplicate removal processing on the single attribute value answering text data by adopting an attribute-retention longest character principle and an attribute-value-retention longest character principle to obtain a single attribute value answering text data set after duplicate removal processing.

Corresponding to S66, the single-attribute value question text data in the single-attribute value question text data set after the deduplication processing and the single-attribute value question text data in the single-attribute value question text data set after the deduplication processing are paired one by one, so as to obtain a plurality of question-answer pairs after the deduplication processing.

And repeatedly executing the steps S62 to S66 until the optimized suspected similar question and answer pair set and the optimized incompletely similar question and answer pair set are processed.

And corresponding to the step S67, taking all the question and question mark-and-answer pairs after the weight removal as the question and question mark-and-answer pair set after the weight removal.

In an embodiment, the step of searching and deleting the question text data with the attribute quantity equal to 0 for the suspected similar question-and-answer pair set and the incomplete similar question-and-answer pair set respectively to obtain the optimized suspected similar question-and-answer pair set and the optimized incomplete similar question-and-answer pair set includes:

s611: finding out the question text data with the attribute quantity equal to 0 from all the question text data in the suspected similar question and answer pair set to obtain first question text data to be deleted;

s612: deleting the first question text data to be deleted from the suspected similar question-and-answer pair set to obtain the optimized suspected similar question-and-answer pair set;

s613: finding out the answering text data with the attribute quantity equal to 0 from all the answering text data in the incompletely similar question and answer pair set to obtain first answering text data to be deleted;

s614: and deleting the first to-be-deleted question and answer text data from the incomplete similar question and answer pair set to obtain the optimized incomplete similar question and answer pair set.

The embodiment realizes the deletion of the question text data with the attribute quantity equal to 0, thereby solving the problem that no intention exists in the standard problem.

Corresponding to S611, finding out the question text data with the attribute number equal to 0 from all the question text data in the suspected similar question-and-answer pair set, and using the found question text data as the first question text data to be deleted.

Corresponding to S612, deleting the first question text data to be deleted from the suspected similar question-and-answer pair set, and taking the suspected similar question-and-answer pair set after deletion processing as the optimized suspected similar question-and-answer pair set.

Corresponding to S613, the answer text data with the attribute number equal to 0 is found out from all the answer text data in the incomplete similar question-answer pair set, and the found answer text data is used as the first answer text data to be deleted.

And correspondingly S614, deleting the first to-be-deleted question and answer text data from the incomplete similar question and answer pair set, and taking the incomplete similar question and answer pair set after deletion as the optimized incomplete similar question and answer pair set.

Referring to fig. 2, the present application further provides a device for administering questions and answers of a knowledge base, the device comprising:

the data acquisition module 100 is configured to acquire a plurality of question mark-and-answer pairs to be administered, where the question mark-and-answer pairs to be administered include: the question marking text data to be treated and the answer marking text data to be treated;

the entity recognition module 200 is configured to perform entity recognition on the multiple question and answer pairs to be managed, which are input into an entity recognition model, to obtain entity data sets to be deduplicated corresponding to the multiple question and answer pairs to be managed, where the entity recognition model is obtained based on a pre-training model bert _ this and CRF network training;

the entity data deduplication processing module 300 is configured to perform deduplication processing on the entity data sets to be deduplicated, so as to obtain deduplicated entity data sets corresponding to the multiple question answering pairs to be treated;

an entity data alignment processing module 400, configured to perform entity data alignment processing according to the deduplicated entity data set and the multiple question and answer pairs to be treated, to obtain a question and answer pair set after entity alignment;

a similarity judging module 500, configured to perform similarity judgment on the challenge label-answer pair sets after the entities are aligned, to obtain a suspected similar challenge label-answer pair set, an incomplete similar challenge label-answer pair set, and a dissimilar challenge label-answer pair set, and update the dissimilar challenge label-answer pair set to a target knowledge base;

an attribute deduplication processing and attribute value deduplication processing module 600, configured to perform attribute deduplication processing and attribute value deduplication processing on the suspected similar question-and-answer pair set and the incomplete similar question-and-answer pair set to obtain a deduplicated question-and-answer pair set, and update the deduplicated question-and-answer pair set into the target knowledge base.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer equipment is used for storing data such as a treatment method of the questions and answers of the knowledge base. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a method for governing the questions and answers of the knowledge base. The treatment method of the questions and answers of the knowledge base comprises the following steps: obtaining a plurality of question mark-and-answer pairs to be administered, wherein the question mark-and-answer pairs to be administered comprise: the question marking text data to be treated and the answer marking text data to be treated; performing entity recognition on the multiple question and answer pairs to be treated input entity recognition models to obtain entity data sets to be de-duplicated corresponding to the multiple question and answer pairs to be treated, wherein the entity recognition models are models obtained based on a pre-training model bert _ this and CRF network training; carrying out entity data deduplication processing on the entity data sets to be deduplicated to obtain multiple deduplicated entity data sets corresponding to the question and answer pairs to be treated; performing entity data alignment treatment according to the de-duplicated entity data set and the plurality of question and answer pairs to be treated to obtain a question and answer pair set after entity alignment; according to the similarity judgment of the mark-question and mark-answer pair sets after entity alignment, a suspected similar mark-question and mark-answer pair set, an incomplete similar mark-question and answer pair set and a dissimilar mark-question and answer pair set are obtained, and the dissimilar mark-question and answer pair set is updated into a target knowledge base; and performing attribute duplication elimination processing and attribute value duplication elimination processing on the suspected similar question and question pair sets and the incompletely similar question and question pair sets to obtain a duplicated question and question pair set, and updating the duplicated question and question pair set into the target knowledge base.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for governing the questioning and answering of the knowledge base is implemented, including the steps of: obtaining a plurality of question mark-and-answer pairs to be administered, wherein the question mark-and-answer pairs to be administered comprise: the question marking text data to be treated and the answer marking text data to be treated; performing entity recognition on the multiple question and answer pairs to be treated input entity recognition models to obtain entity data sets to be de-duplicated corresponding to the multiple question and answer pairs to be treated, wherein the entity recognition models are models obtained based on a pre-training model bert _ this and CRF network training; carrying out entity data deduplication processing on the entity data sets to be deduplicated to obtain multiple deduplicated entity data sets corresponding to the question and answer pairs to be treated; performing entity data alignment treatment according to the de-duplicated entity data set and the plurality of question and answer pairs to be treated to obtain a question and answer pair set after entity alignment; according to the similarity judgment of the mark-question and mark-answer pair sets after entity alignment, a suspected similar mark-question and mark-answer pair set, an incomplete similar mark-question and answer pair set and a dissimilar mark-question and answer pair set are obtained, and the dissimilar mark-question and answer pair set is updated into a target knowledge base; and performing attribute duplication elimination processing and attribute value duplication elimination processing on the suspected similar question and question pair sets and the incompletely similar question and question pair sets to obtain a duplicated question and question pair set, and updating the duplicated question and question pair set into the target knowledge base.

The executed method for governing the bid-challenge answers of the knowledge base comprises the steps of performing entity recognition on a plurality of bid-challenge answer pairs to be governed and input entity recognition models to obtain a plurality of entity data sets to be deduplicated corresponding to the bid-challenge answer pairs to be governed, performing entity data deduplication processing on the entity data sets to be deduplicated to obtain a plurality of entity data sets to be governed and to which the bid-challenge answer pairs correspond, performing entity data alignment processing according to the entity data sets after deduplication and the plurality of the bid-challenge answer pairs to be governed to obtain a bid-challenge answer pair set after entity alignment, performing similarity judgment according to the bid-challenge answer pair set after entity alignment to obtain a similar bid-challenge answer pair set, an incomplete similar bid-challenge pair set and a dissimilar bid-challenge pair set, updating the dissimilar bid-challenge pair set into the target knowledge base, and performing attribute deduplication processing and attribute value processing on the similar bid-challenge pair set and the incomplete challenge pair set which are suspected to be similar The repeated processing is carried out to obtain the question and answer pair set after the repeated processing, and the question and answer pair set after the repeated processing is updated to the target knowledge base, so that the technical problems that the knowledge of the knowledge base in the prior art is repeated among standard problems, one standard problem has a plurality of intentions or no intentions, the number of the intentions of the standard problem is not equal to the number of the values of the standard answers, and the answers are incomplete or too many do not occur in the target knowledge base, the quality of the knowledge base is improved, continuous manual participation in the treatment process is not needed, and the treatment efficiency is improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method for governing questions, answers and signs of a knowledge base is characterized by comprising the following steps:

2. The method for governing question and answer marking of a knowledge base according to claim 1, wherein before the step of performing entity recognition on the plurality of question and answer marking pairs to be governed by an input entity recognition model to obtain the entity data sets to be deduplicated corresponding to the plurality of question and answer marking pairs to be governed, the method further comprises:

3. The method for governing question and answer marking of a knowledge base according to claim 1, wherein the step of performing entity data alignment processing according to the de-duplicated entity data set and the plurality of question and answer marking pairs to be governed to obtain a question and answer marking pair set after entity alignment comprises:

4. The method for governing the bid-upon answers of the knowledge base according to claim 1, wherein the step of obtaining the bid-upon answer pair set suspected to be similar, the bid-upon answer pair set incompletely similar, and the bid-upon answer pair set dissimilar according to the similarity judgment of the bid-upon answer pair sets after the entity alignment comprises:

5. The method for governing the bid-asking bid-answering of the knowledge base according to claim 4, wherein the step of obtaining the suspected similar bid-asking bid-answering pair set, the incomplete similar bid-asking bid-answering pair set, and the dissimilar bid-asking bid-answering pair set by performing similarity judgment according to the entity-aligned bid-asking bid-answering pair set, the bid-asking similarity matrix, and the bid-answering similarity matrix comprises:

acquiring a second similarity threshold;

6. The method for governing question and answer responses according to claim 1, wherein said step of performing attribute de-duplication processing and attribute value de-duplication processing on said pair set of question and answer responses that are suspected to be similar and said pair set of question and answer that are not completely similar to obtain a de-duplicated pair set of question and answer includes:

7. The method for governing question and answer marking of a knowledge base according to claim 6, wherein the step of performing search and deletion of the question text data with the number of attributes equal to 0 on the suspected similar question and answer pair set and the incomplete similar question and answer pair set respectively to obtain an optimized suspected similar question and answer pair set and an optimized incomplete similar question and answer pair set comprises:

8. A device for administering questions and answers to knowledge bases, said device comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.