CN116127044A

CN116127044A - System evaluation method and device, electronic equipment and storage medium

Info

Publication number: CN116127044A
Application number: CN202310205760.5A
Authority: CN
Inventors: 刘禾子; 刘坤; 黄子晴; 刘凯; 丁鑫哲
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-05-16

Abstract

The disclosure discloses a system evaluation method and device, electronic equipment and a storage medium, relates to the technical field of computers, and particularly relates to the technical field of NLP. The specific implementation scheme is as follows: acquiring mining results output by the question-answer mining system aiming at input data; determining a first question-answer pair subset meeting recall conditions according to text similarity between the question-answer pair set and target question-answer results corresponding to input data, and determining recall rate evaluation results corresponding to a question-answer mining system according to the first question-answer pair subset; acquiring a second score corresponding to any question-answer pair in a second question-answer pair set by adopting a question-answer system based on information retrieval; and determining a target score corresponding to any question-answer pair according to the first score corresponding to any question-answer pair and the second score corresponding to any question-answer pair, and determining an accuracy evaluation result corresponding to the question-answer mining system according to the target score corresponding to any question-answer pair and the first question-answer pair subset. The present disclosure may thus improve the accuracy of system assessment.

Description

System evaluation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to the field of natural language processing (Natural Language Processing, NLP) technology, and more particularly, to a system evaluation method and apparatus, an electronic device, and a storage medium.

Background

With the development of scientific technology, deep learning technology has also been developed, in which data scarcity gradually becomes a bottleneck of large model capacity, and thus, an excavating system has been developed. For example, the mining system may obtain question-answer pairs so that downstream applications may use them. Because recall evaluation for the mining system is mainly performed by manual evaluation, the evaluation accuracy of the mining system is lower.

Disclosure of Invention

The disclosure provides a system evaluation method and device, electronic equipment and a storage medium, and aims to improve accuracy of system evaluation.

According to an aspect of the present disclosure, there is provided a system evaluation method including:

acquiring mining results output by a question-answer mining system aiming at input data, wherein the mining results comprise a question-answer pair set and first scores corresponding to any question-answer pair in the question-answer pair set;

determining a first question-answer pair subset meeting recall conditions according to the text similarity between the question-answer pair set and target question-answer results corresponding to the input data, and determining recall rate evaluation results corresponding to the question-answer mining system according to the first question-answer pair subset;

Acquiring a second score corresponding to any question-answer pair in a second question-answer pair set by adopting a question-answer system based on information retrieval, wherein the second question-answer pair subset is a set except the first question-answer pair subset in the question-answer pair set;

and determining a target score corresponding to any question-answer pair according to the first score corresponding to any question-answer pair and the second score corresponding to any question-answer pair, and determining an accuracy evaluation result corresponding to the question-answer mining system according to the target score corresponding to any question-answer pair and the first question-answer pair subset.

According to another aspect of the present disclosure, there is provided a system evaluation apparatus including:

the system comprises a result acquisition unit, a query mining system and a query mining unit, wherein the result acquisition unit is used for acquiring mining results output by the query mining system aiming at input data, and the mining results comprise a query pair set and a first score corresponding to any one of the query pair set;

the result determining unit is used for determining a first question-answer pair subset meeting recall conditions according to the text similarity between the question-answer pair set and target question-answer results corresponding to the input data, and determining recall rate evaluation results corresponding to the question-answer mining system according to the first question-answer pair subset;

A score obtaining unit, configured to obtain a second score corresponding to any question-answer pair in a second question-answer pair subset by using a question-answer system based on information retrieval, where the second question-answer pair subset is a set other than the first question-answer pair subset in the question-answer pair set;

the result determining unit is further configured to determine a target score corresponding to the any question-answer pair according to the first score corresponding to the any question-answer pair and the second score corresponding to the any question-answer pair, and determine an accuracy evaluation result corresponding to the question-answer mining system according to the target score corresponding to the any question-answer pair and the first question-answer pair subset.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the preceding aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any one of the preceding aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any one of the preceding aspects.

In one or more embodiments of the present disclosure, mining results output by a question-and-answer mining system mining input data are obtained; determining a first question-answer pair subset meeting recall conditions according to text similarity between the question-answer pair set and target question-answer results corresponding to input data, and determining recall rate evaluation results corresponding to a question-answer mining system according to the first question-answer pair subset; acquiring a second score corresponding to any question-answer pair in a second question-answer pair set by adopting a question-answer system based on information retrieval; and determining a target score corresponding to any question-answer pair according to the first score corresponding to any question-answer pair and the second score corresponding to any question-answer pair, and determining an accuracy evaluation result corresponding to the question-answer mining system according to the target score corresponding to any question-answer pair and the first question-answer pair subset. Therefore, the recall rate and the accuracy rate of the question-answer mining system can be evaluated in different modes, manual evaluation is not needed, the condition that the manual evaluation causes inaccurate system evaluation is reduced, and the evaluation cost is reduced while the accuracy and the efficiency of the system evaluation are improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of a system evaluation method according to a first embodiment of the present disclosure;

FIG. 2 is a flow diagram of a system evaluation method according to a second embodiment of the present disclosure;

FIG. 3 is a flow diagram of a similarity determination method according to one embodiment of the present disclosure;

FIG. 4 is a flow diagram of a recall method according to one embodiment of the present disclosure;

FIG. 5 is a flow diagram of ES database creation according to one embodiment of the present disclosure;

FIG. 6 is a flow diagram of a recall method according to one embodiment of the present disclosure;

FIG. 7 (a) is a schematic structural view of a first system evaluation device for implementing the system evaluation method of the embodiment of the present disclosure;

FIG. 7 (b) is a schematic structural diagram of a second system evaluation device for implementing the system evaluation method of the embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a system evaluation method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present disclosure is described in detail below with reference to specific examples.

In a first embodiment, as shown in fig. 1, fig. 1 is a flow chart of a system evaluation method according to a first embodiment of the present disclosure, which may be implemented in dependence on a computer program, and may be run on a device performing the system evaluation. The computer program may be integrated in the application or may run as a stand-alone tool class application.

The system evaluation device may be an electronic device with image processing capabilities, including but not limited to: an autonomous vehicle, a wearable device, a handheld device, a personal computer, a tablet computer, an in-vehicle device, a smart phone, a computing device, or other processing device connected to a wireless modem, etc. Terminals may be called different names in different networks, for example: a user equipment, an access terminal, a subscriber unit, a subscriber station, a mobile station, a remote terminal, a mobile device, a user terminal, a wireless communication device, a user agent or user equipment, a cellular telephone, a cordless telephone, a personal digital assistant (personal digital assistant, PDA), a fifth Generation mobile communication technology (5th Generation Mobile Communication Technology,5G) network, a fourth Generation mobile communication technology (the 4th Generation mobile communication technology,4G) network, a third Generation mobile communication technology (3 rd-Generation, 3G) network, or an electronic device in a future evolution network, and the like.

Specifically, the system evaluation method comprises the following steps:

s101, acquiring mining results output by a question-answer mining system aiming at input data;

according to some embodiments, the question and answer mining system is a system capable of mining input data to obtain question and answer pair parameters corresponding to the input data. The question-answer mining system is not particularly limited to a fixed mining system. For example, when the application scene corresponding to the input data changes, the question-answer mining system may also change accordingly. For example, when the mining parameters corresponding to the question-and-answer mining system change, the question-and-answer mining system may also change accordingly. Wherein the input data may comprise, for example, an input question.

It is easy to understand that the mining result refers to a result corresponding to the input data mined by the question-answer mining system. The mining result comprises a question-answer pair set and a first score corresponding to any question-answer pair in the question-answer pair set. The mining result is not specific to a certain fixed result. For example, when input data changes, the mining result may also change accordingly. For example, when the question-answer mining system changes, the mining results may also change accordingly.

According to some embodiments, the question-answer pair set refers to a collective body formed by gathering at least one question-answer pair corresponding to input data. The question-answer pair set does not refer specifically to a fixed set. For example, when the number of question-answer pairs included in the set of question-answer pairs changes, the set of question-answer pairs may also change accordingly. For example, when a particular question-answer pair included in the set of question-answer pairs changes, the set of question-answer pairs may also change accordingly.

In some embodiments, the first score refers to a score output by the question and answer mining system for question and answer pairs when mining the input data. Wherein a question and answer corresponds to a first score. The first of the first scores is used only to distinguish from the second score. The first score is not specific to a certain fixed score. For example, when the question-answer pair changes, the first score may also change accordingly. For example, when a specific score value corresponding to the first score changes, the first score may also change accordingly.

According to some embodiments, when the system evaluation method is performed, the mining result output by the question-answer mining system for mining the input data may be acquired.

S102, determining a first question-answer pair subset meeting recall conditions according to text similarity between a question-answer pair set and target question-answer results corresponding to input data, and determining recall rate evaluation results corresponding to a question-answer mining system according to the first question-answer pair subset;

According to some embodiments, the target question-answer result may be, for example, a manually labeled result, or a result obtained in advance for the input data. The target question-answer result corresponds to the input data, that is, different input data corresponds to different target question-answer results. Or when the manual annotation changes, the target question-answering result can also change correspondingly.

In some embodiments, text similarity is used to indicate similarity between the set of question-answer pairs and the target question-answer result. For example, the similarity between any question in the question-answer pair set and the target question-answer result can be obtained. The text similarity is not particularly limited to a certain fixed similarity. For example, when a set of question-answer pairs or a target question-answer result changes, the text similarity may also change accordingly. For example, when the similarity obtaining manner changes, the text similarity may also change accordingly.

It is readily understood that recall conditions refer to conditions used to determine whether to recall a question-answer pair. The recall condition is not specific to a particular fixed condition. For example, when a modification instruction for a recall condition is received, the recall condition may also change accordingly. The recall condition may be, for example, that the text similarity is greater than a similarity threshold.

Optionally, the first question-answer pair subset refers to a collective of at least one question-answer pair that has been recalled. The first subset of question-answer pairs does not refer to a certain fixed subset. For example, when the number of question-answer pairs corresponding to a first question-answer pair is changed, the first question-answer pair subset may also be changed accordingly. For example, when a first question-answer pair changes for a particular question-answer pair included in the subset, the first question-answer pair subset may also change accordingly.

In some embodiments, the recall evaluation result refers to the result of a recall evaluation performed on the question and answer mining system. The recall evaluation result may be determined from the first question-answer pair subset. Specifically, recall rate evaluation results are determined according to the first question-answer pair subset and the question-answer pair set.

Optionally, when the question-answer pair set is obtained, a first question-answer pair subset meeting recall conditions can be determined according to the text similarity between the question-answer pair set and target question-answer results corresponding to the input data, and recall rate evaluation results corresponding to the question-answer mining system can be determined according to the first question-answer pair subset.

S103, acquiring a second score corresponding to any question-answer pair in a second question-answer pair set by adopting a question-answer system based on information retrieval;

In some embodiments, the information retrieval-based question-and-answer system (Information Retrieval Question Answering, IRQA) refers to a system that, given a collated question-and-answer pair (QA pair), finds one from the question-and-answer pair that is semantically equivalent or close to the user's question by understanding the user's question, and returns the answer a in the question-and-answer pair as the answer to the user's question.

Optionally, the second subset of question-pairs is a set of question-pairs other than the first subset of question-pairs. The second question-answer subset may be, for example, a collective of at least one question-answer pair that is not recalled. The second subset of question-answer pairs does not refer to a certain fixed subset. For example, when the number of question-answer pairs corresponding to the second question-answer pair changes, the second question-answer pair subset may also change accordingly. For example, when a second question-answer pair changes to a particular question-answer pair included in the subset, the second question-answer pair subset may also change accordingly.

In one embodiment, the second score refers to the score output by the IRQA system to retrieve any question-answer pair. When any question-answer pair changes, the second score may also change accordingly.

It is readily understood that a question-answer system based on information retrieval may be employed to obtain a second score for any question-answer pair in the second question-answer pair set.

S104, determining a target score corresponding to any question-answer pair according to the first score corresponding to any question-answer pair and the second score corresponding to any question-answer pair, and determining an accuracy evaluation result corresponding to the question-answer mining system according to the target score corresponding to any question-answer pair and the first question-answer pair subset.

According to some embodiments, the target score refers to a score corresponding to any question-answer pair in the second question-answer pair set. The score may be determined from the first score and the second score. For example, when the first score or the second score changes, the target score may also change accordingly.

It is easy to understand that the target score corresponding to any question-answer pair can be determined according to the first score corresponding to any question-answer pair and the second score corresponding to any question-answer pair, and the accuracy evaluation result corresponding to the question-answer mining system can be determined according to the target score corresponding to any question-answer pair and the first question-answer pair subset.

In one or more embodiments of the present disclosure, mining results output by a question-and-answer mining system mining input data are obtained; determining a first question-answer pair subset meeting recall conditions according to text similarity between the question-answer pair set and target question-answer results corresponding to input data, and determining recall rate evaluation results corresponding to a question-answer mining system according to the first question-answer pair subset; acquiring a second score corresponding to any question-answer pair in a second question-answer pair set by adopting a question-answer system based on information retrieval; and determining a target score corresponding to any question-answer pair according to the first score corresponding to any question-answer pair and the second score corresponding to any question-answer pair, and determining an accuracy evaluation result corresponding to the question-answer mining system according to the target score corresponding to any question-answer pair and the first question-answer pair subset. Therefore, the recall rate and the accuracy rate of the question-answer mining system can be evaluated in different modes, manual evaluation is not needed, the condition that the manual evaluation causes inaccurate system evaluation is reduced, and the evaluation cost is reduced while the accuracy and the efficiency of the system evaluation are improved. Secondly, when the accuracy rate of the system is evaluated, the manual evaluation cost and the time for manual evaluation can be reduced, the iterative optimization of the model system is facilitated, and the landing pace of the question-answer mining strategy is indirectly accelerated.

Referring to fig. 2, fig. 2 is a flowchart of a system evaluation method according to a second embodiment of the disclosure. In particular, the method comprises the steps of,

s201, acquiring mining results output by a question-answer mining system aiming at input data;

the mining result comprises a question-answer pair set and a first score corresponding to any question-answer pair in the question-answer pair set.

The specific process is as described above, and will not be described here again.

S202, obtaining answer similarity between any question-answer pair in a question-answer pair set and a target question-answer result corresponding to input data;

It is easy to understand that the target question-answer result may be, for example, a target question-answer pair.

In some embodiments, answer similarities are used to indicate similarities between answers. Wherein, the answer similarity between different question-answer pairs and the target question-answer results is different. For example, the manner in which the answer similarity is obtained may also vary accordingly.

According to some embodiments, obtaining answer similarity between any question-answer pair in the question-answer pair set and a target question-answer result corresponding to the input data includes: cutting answer texts in any question-answer pair and answer texts in a target question-answer result corresponding to the input data respectively to obtain a first answer text and a second answer text, wherein the text length of the first answer text is smaller than that of the second answer text; determining the number of overlapping text fragments of the first answer text on the second answer text; and determining the answer similarity between any question-answer pair in the question-answer pair set and the target question-answer result according to the coincident number and the text length of the second answer text. Therefore, the answer similarity is determined according to the number of the text fragment overlapping, the accuracy of obtaining the answer similarity can be improved, the accuracy of obtaining the first question-answer pair can be improved, and the accuracy of recall rate evaluation can be improved. In addition, when the length of the answer text is longer, the time length for determining the answer similarity by adopting the deep neural network can be reduced, and the determination efficiency of the answer similarity can be improved.

In some embodiments, the first answer text and the second answer text do not specifically correspond to a certain answer text. In the embodiment of the disclosure, after the answer text in any question-answer pair and the answer text in the target question-answer result corresponding to the input data are cut, the one with the smaller text length of the two obtained cut answer texts is determined to be the first answer text, and the one with the longer text length is determined to be the second answer text.

It is easy to understand that after the answer text in any question-answer pair and the answer text in the target question-answer result corresponding to the input data are cut, the answer text after cutting corresponding to any question-answer pair and the answer text after cutting corresponding to the target question-answer result can be obtained. For example, when the text length of the cut answer text corresponding to any question-answer pair is smaller than the text length of the cut answer text corresponding to the target question-answer result, the first text answer is the cut answer text corresponding to any question-answer pair, and the second answer text is the cut answer text corresponding to the target question-answer result.

S203, obtaining the similarity of the questions between any question-answer pair and the target question-answer result;

In some embodiments, problem similarities are used to indicate similarities between problems. Wherein, the question similarity between different question-answer pairs and target question-answer results is different. For example, when the problem similarity is obtained, the manner may also be changed accordingly.

According to some embodiments, obtaining a degree of similarity of questions between any question-answer pair and a target question-answer result includes: acquiring a first text feature vector corresponding to a first question text in any question-answer pair; acquiring a second text feature vector corresponding to any question-answer pair in the target question-answer result; determining the distance between the first text feature vector and the second text feature vector in space by adopting a cosine distance determination mode; and determining the similarity of the questions between any question-answer pair and the target question-answer result according to the distance. Therefore, the problem similarity is determined according to the text feature vector, the accuracy of acquiring the problem similarity can be improved, the accuracy of acquiring the first question-answering pair sub-set can be improved, and the accuracy of recall rate evaluation can be improved.

According to some embodiments, the problem similarity may be determined, for example, by means of semantic computation. For example, a first text feature vector corresponding to a first question text in any question-answer pair is obtained; and obtaining a second text feature vector corresponding to any question-answer pair in the target question-answer result. The text feature vector refers to data which can be identified by the model. The first text feature vector and the second text feature vector may be input into a deep learning network model to represent the first text feature vector and the second text feature vector in a higher dimension. The deep learning network model may be, for example, a pre-trained language model.

It is readily understood that determining the similarity of questions between any question-answer pair and the target question-answer result based on distance includes indicating that the closer the distance is, the higher the similarity of questions is.

Optionally, fig. 3 is a flow chart of a similarity determination method according to an embodiment of the disclosure. Question text 1 may be, for example, a first question text and question text 2 may be, for example, a second question text. When the deep learning network model is adopted to calculate the problem similarity, for example, text characterization can be performed on the problem text 1 and the problem text 2 at an input layer, higher-dimension representation can be performed on the characterized text through a complex neural network at a representation layer, and similarity calculation can be performed at a matching layer.

S204, determining the text similarity between any question-answer pair and a target question-answer result according to the answer similarity and the question similarity;

according to some embodiments, when the answer similarity and the question similarity are obtained, the text similarity between any question-answer pair and the target question-answer result can be determined according to the answer similarity and the question similarity.

The answer similarity and the question similarity may correspond to different weights, or may be the mean of the answer similarity and the question similarity as the text similarity. The embodiments of the present disclosure are not limited in this regard.

S205, adding any question-answer pair to the first question-answer pair subset under the condition that the text similarity between any question-answer pair and the target question-answer result is larger than a similarity threshold;

in some embodiments, a similarity threshold is used to determine whether to recall the any question-answer pair, and the similarity threshold is not specific to a fixed threshold. The similarity threshold may be modified, for example, when a modification instruction for the similarity threshold is received.

Alternatively, in the case where the text similarity between any question-answer pair and the target question-answer result is greater than the similarity threshold, any question-answer pair may be recalled, for example, any question-answer pair may be added to the first question-answer pair subset.

In some embodiments, in the case where the text similarity between any question-answer pair and the target question-answer result is greater than the similarity threshold, the recall of any question-answer pair may not be performed, e.g., any question-answer pair may be added to the second subset of question-answer pairs.

Alternatively, FIG. 4 is a flow diagram of a recall method according to one embodiment of the present disclosure. For example, the input data may be subjected to question-answer mining, resulting in mining results that include questions/answers/first scores. And carrying out matching calculation on the target question-answer pairs and the mining results to obtain answer similarity and question similarity, further obtaining text similarity, and determining whether any question-answer pair is recalled according to the text similarity.

S206, determining recall rate evaluation results corresponding to the question-answer mining system according to the first question-answer subset;

According to some embodiments, the calculation formula of the recall may be, for example, as shown in formula (one):

it is to be readily appreciated that, in embodiments of the present disclosure, the number of mining question-answer pairs hitting a target question-answer pair may be, for example, the number of question-answer pairs included in the first question-answer pair, and the target question-answer pair number may be, for example, the number of question-answer pairs corresponding to the set of question-answer pairs.

S207, acquiring a second score corresponding to any question-answer pair in a second question-answer pair set by adopting a question-answer system based on information retrieval;

wherein the second subset of question pairs is a set of question-answer pairs other than the first subset of question-answer pairs.

According to some embodiments, the information retrieval-based question-answering system includes an ES database, a fine-ranking model, and a reading understanding model, and the obtaining, by using the information retrieval-based question-answering system, a second score corresponding to any one of a second question-answering pair set includes: searching the question text in any question-answer pair in the second question-answer pair in the ES database to obtain at least one answer text paragraph corresponding to the question text; screening at least one answer text paragraph by adopting a fine-ranking model to obtain a preset number of answer text paragraphs corresponding to the question text; determining a ranking score corresponding to any answer text paragraph based on a first similarity between the question text and any answer text paragraph in a preset number of answer text paragraphs; inputting the question text and any answer text paragraph into a reading understanding model, acquiring a second answer text corresponding to the question text in any answer text paragraph, and acquiring a reading understanding score corresponding to the second answer text; determining a similarity score corresponding to any answer text paragraph according to the second similarity of the first answer text and the second answer text; and determining a second score corresponding to any question-answer pair according to the ranking score, the reading understanding score and the similarity score. Therefore, the accuracy evaluation is not required to be carried out manually, the condition that the accuracy evaluation is inaccurate is reduced, and the accuracy of the accuracy evaluation is improved.

According to some embodiments, the ES is fully called an elastic search, and is a non-relational distributed full-text retrieval framework, and the ES is suitable for complex retrieval and full-text retrieval scenes by adopting a storage mode of inverted indexes. Therefore, the accuracy of paragraph retrieval can be improved through the ES database.

According to some embodiments, determining a corresponding second score for any question-answer pair according to the ranking score, the reading understanding score, and the similarity score includes: acquiring a first weight corresponding to the sorting score; acquiring a second weight corresponding to the reading understanding score; obtaining a third weight corresponding to the similarity score; and determining a second score corresponding to any question-answer pair according to the ranking score, the first weight, the reading understanding score, the second weight, the similarity score and the third weight. Therefore, different weights are set for different scores, so that the matching property of the second score and input data can be improved, the accuracy of determining the second score can be improved, and the accuracy of determining an evaluation result can be improved.

According to some embodiments, the reading understanding model is a pre-trained extraction model, and the reading understanding model can predict a starting position and an ending position of an answer in a paragraph according to the paragraph and a question, output two subscripts with maximum probability values, and then select a text in the range of the subscripts as a final answer text.

According to some embodiments, the text format of the full text may be converted into a target format, resulting in the full text in the target format; segmenting the full text in the target format to obtain at least one text paragraph in the target format; at least one text paragraph in the target format is added to the ES database. Fig. 5 is a flow diagram of ES database creation according to one embodiment of the present disclosure. For example, the input document may be parsed and the document segments corresponding to the input document may be stored in the ES database.

S208, determining a target score corresponding to any question-answer pair according to the first score corresponding to any question-answer pair and the second score corresponding to any question-answer pair, and determining an accuracy evaluation result corresponding to the question-answer mining system according to the target score corresponding to any question-answer pair and the first question-answer pair subset.

According to some embodiments, when determining the accuracy evaluation result corresponding to the question-answer mining system according to the target score corresponding to any question-answer pair and the first question-answer pair subset, the first question-answer pair subset may be directly determined to be correct, for example. Any question-answer pair in the second question-answer pair set, for example, needs to determine whether it is correct according to the corresponding target score of any question-answer pair. For example, when the target score corresponding to any question-answer pair is greater than the score threshold, determining that any question-answer pair is correct, and when the target score corresponding to any question-answer pair is less than the score threshold, determining that any question-answer pair is wrong.

According to some embodiments, the target score corresponding to any question-answer pair is determined according to the first score corresponding to any question-answer pair and the second score corresponding to any question-answer pair, for example, a mean value of the first score corresponding to any question-answer pair and the second score corresponding to any question-answer pair may be used as the target score corresponding to any question-answer pair.

According to some embodiments, the accuracy assessment result may be determined, for example, according to equation (two):

it is to be readily appreciated that, in the embodiments of the present disclosure, the correct number of mining question-answer pairs may be, for example, a sum of the number of question-answer pairs included in the first question-answer pair and the number of correct question-answer pairs in the second question-answer pair set, and the number of mining question-answer pairs may be, for example, the number of question-answer pairs corresponding to the set of question-answer pairs.

Fig. 6 is a flow diagram of a recall method according to one embodiment of the present disclosure, according to some embodiments. For example, based on fig. 3, each question in the second question-answer subset may be requested for IRQA question-answer service one by one to obtain a ranking score, a reading understanding score, and a similarity score. For the first question-answer pair subset, it is directly determined as correct, and for the second question-answer pair subset, it is necessary to determine whether the mining result is correct based on the ranking score, the reading understanding score, and the similarity score.

In one or more embodiments of the present disclosure, mining results output by a question-answer mining system for mining input data are obtained; obtaining answer similarity between any question-answer pair in the question-answer pair set and a target question-answer result corresponding to the input data; obtaining the similarity of questions between any question-answer pair and a target question-answer result; according to the answer similarity and the question similarity, determining the text similarity between any question-answer pair and the target question-answer result; under the condition that the text similarity between any question-answer pair and the target question-answer result is larger than a similarity threshold, adding any question-answer pair to the first question-answer pair subset, and determining a recall rate evaluation result corresponding to the question-answer mining system according to the first question-answer pair subset, so that the recall rate evaluation result can be determined by adopting the answer similarity and the question similarity, the recall condition is determined only based on the literal similarity, the inaccurate condition is determined by the recall rate evaluation result, and the evaluation accuracy of the recall rate evaluation result can be improved.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Referring to fig. 7 (a), a schematic diagram of a system evaluation device for implementing the system evaluation method according to the embodiment of the present disclosure is shown. The system evaluation device may be implemented as all or part of the device by software, hardware or a combination of both. The system evaluation device 700 includes a result acquisition unit 701, a result determination unit 702, and a score acquisition unit 703, wherein:

a result obtaining unit 701, configured to obtain an mining result output by the question-answer mining system for mining input data, where the mining result includes a question-answer pair set and a first score corresponding to any question-answer pair in the question-answer pair set;

a result determining unit 702, configured to determine a first question-answer pair subset that meets a recall condition according to a text similarity between the question-answer pair set and a target question-answer result corresponding to the input data, and determine a recall rate evaluation result corresponding to the question-answer mining system according to the first question-answer pair subset;

a score acquisition unit 703 for acquiring a second score corresponding to any question-answer pair in a second question-answer pair subset using a question-answer system based on information retrieval, wherein the second question-answer pair subset is a set other than the first question-answer pair set in the question-answer pair set;

The result determining unit 702 is further configured to determine a target score corresponding to any question-answer pair according to the first score corresponding to any question-answer pair and the second score corresponding to any question-answer pair, and determine an accuracy evaluation result corresponding to the question-answer mining system according to the target score corresponding to any question-answer pair and the first question-answer pair subset.

According to some embodiments, the result determining unit 702 is configured to determine, according to the text similarity between the question-answer pair set and the target question-answer result corresponding to the input data, a first subset of question-answer pairs that meets the recall condition, specifically configured to:

obtaining answer similarity between any question-answer pair in the question-answer pair set and a target question-answer result corresponding to the input data;

obtaining the similarity of questions between any question-answer pair and a target question-answer result;

according to the answer similarity and the question similarity, determining the text similarity between any question-answer pair and the target question-answer result;

in the event that the text similarity between any question-answer pair and the target question-answer result is greater than a similarity threshold, adding any question-answer pair to the first subset of question-answer pairs.

According to some embodiments, the result determining unit 702 is configured to, when obtaining answer similarity between any question-answer pair in the question-answer pair set and a target question-answer result corresponding to the input data, specifically:

Cutting answer texts in any question-answer pair and answer texts in a target question-answer result corresponding to the input data respectively to obtain a first answer text and a second answer text, wherein the text length of the first answer text is smaller than that of the second answer text;

determining the number of overlapping text fragments of the first answer text on the second answer text;

and determining the answer similarity between any question-answer pair in the question-answer pair set and the target question-answer result according to the coincident number and the text length of the second answer text.

According to some embodiments, the result determining unit 702 is configured to, when obtaining the similarity of the questions between any question-answer pair and the target question-answer result, specifically:

acquiring a first text feature vector corresponding to a first question text in any question-answer pair;

acquiring a second text feature vector corresponding to any question-answer pair in the target question-answer result;

determining the distance between the first text feature vector and the second text feature vector in space by adopting a cosine distance determination mode;

and determining the similarity of the questions between any question-answer pair and the target question-answer result according to the distance.

According to some embodiments, the question-answering system based on information retrieval includes an ES database, a fine-ranking model and a reading understanding model, and the score obtaining unit 703 is specifically configured to, when obtaining the second score corresponding to any question-answering pair in the second question-answering pair set by using the question-answering system based on information retrieval:

Searching the question text in any question-answer pair in the second question-answer pair in the ES database to obtain at least one answer text paragraph corresponding to the question text;

screening at least one answer text paragraph by adopting a fine-ranking model to obtain a preset number of answer text paragraphs corresponding to the question text;

determining a ranking score corresponding to any answer text paragraph based on a first similarity between the question text and any answer text paragraph in a preset number of answer text paragraphs;

inputting the question text and any answer text paragraph into a reading understanding model, acquiring a second answer text corresponding to the question text in any answer text paragraph, and acquiring a reading understanding score corresponding to the second answer text;

determining a similarity score corresponding to any answer text paragraph according to the second similarity of the first answer text and the second answer text;

and determining a second score corresponding to any question-answer pair according to the ranking score, the reading understanding score and the similarity score.

According to some embodiments, the score obtaining unit 703 is configured to, when determining the second score corresponding to any question-answer pair according to the ranking score, the reading understanding score and the similarity score, specifically:

Acquiring a first weight corresponding to the sorting score;

acquiring a second weight corresponding to the reading understanding score;

obtaining a third weight corresponding to the similarity score;

and determining a second score corresponding to any question-answer pair according to the ranking score, the first weight, the reading understanding score, the second weight, the similarity score and the third weight.

Fig. 7 (b) is a schematic structural diagram of a second system evaluation device for implementing the system evaluation method according to the embodiments of the present disclosure, and as shown in fig. 7 (b), the device 700 further includes a paragraph adding unit 704 for:

converting the text format of the full text into a target format to obtain the full text of the target format;

segmenting the full text in the target format to obtain at least one text paragraph in the target format;

at least one text paragraph in the target format is added to the ES database.

It should be noted that, in the system evaluation device provided in the foregoing embodiment, when the system evaluation method is executed, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the system evaluation device and the system evaluation method provided in the foregoing embodiments belong to the same concept, which embody detailed implementation procedures in the method embodiments, and are not described herein again.

The foregoing embodiment numbers of the present disclosure are merely for description and do not represent advantages or disadvantages of the embodiments.

In summary, the device provided in the embodiment of the present disclosure is configured to obtain, by using a result obtaining unit, an excavation result output by an question-answer excavation system for excavating input data, where the excavation result includes a question-answer pair set and a first score corresponding to any question-answer pair in the question-answer pair set; the result determining unit is used for determining a first question-answer pair subset meeting recall conditions according to the text similarity between the question-answer pair set and target question-answer results corresponding to the input data, and determining recall rate evaluation results corresponding to the question-answer mining system according to the first question-answer pair subset; the score acquisition unit is used for acquiring a second score corresponding to any question-answer pair in a second question-answer pair subset by adopting a question-answer system based on information retrieval, wherein the second question-answer pair subset is a set except the first question-answer pair subset in the question-answer pair set; the result determining unit is further used for determining a target score corresponding to any question-answer pair according to the first score corresponding to any question-answer pair and the second score corresponding to any question-answer pair, and determining an accuracy evaluation result corresponding to the question-answer mining system according to the target score corresponding to any question-answer pair and the first question-answer pair subset. Therefore, the recall rate and the accuracy rate of the question-answer mining system can be evaluated in different modes, manual evaluation is not needed, the condition that the manual evaluation causes inaccurate system evaluation is reduced, and the evaluation cost is reduced while the accuracy and the efficiency of the system evaluation are improved.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Wherein the components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in the electronic device are connected to the I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic device to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a system evaluation method. For example, in some embodiments, the system evaluation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device via the ROM 802 and/or the communication unit 809. When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the system evaluation method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the system evaluation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or electronic device.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data electronic device), or that includes a middleware component (e.g., an application electronic device), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and an electronic device. The client and the electronic device are generally remote from each other and typically interact through a communication network. The relationship of client and electronic devices arises by virtue of computer programs running on the respective computers and having a client-electronic device relationship to each other. The electronic equipment can be cloud electronic equipment, also called cloud computing electronic equipment or cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service (Virtual Private Server or VPS for short) are overcome. The electronic device may also be an electronic device of a distributed system or an electronic device that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A system evaluation method, comprising:

2. The method of claim 1, wherein the determining a first subset of question-answer pairs that satisfy a recall condition according to a text similarity between the set of question-answer pairs and target question-answer results corresponding to the input data comprises:

obtaining the similarity of the questions between any question-answer pair and the target question-answer result;

determining the text similarity between any question-answer pair and the target question-answer result according to the answer similarity and the question similarity;

And adding any question-answer pair to the first question-answer pair subset under the condition that the text similarity between the any question-answer pair and the target question-answer result is larger than a similarity threshold value.

3. The method of claim 2, wherein the obtaining answer similarity between any question-answer pair in the set of question-answer pairs and a target question-answer result corresponding to the input data comprises:

and determining the answer similarity between any question-answer pair in the question-answer pair set and a target question-answer result according to the coincident number and the text length of the second answer text.

4. The method of claim 2, wherein the obtaining the similarity of questions between the any question-answer pair and the target question-answer result comprises:

and determining the similarity of the questions between any question and answer pair and the target question and answer result according to the distance.

5. The method of claim 1, wherein the information retrieval-based question-answering system includes an ES database, a fine-ranking model, and a reading understanding model, the employing the information retrieval-based question-answering system to obtain a second score for any of a second question-answering pair set, comprising:

screening the at least one answer text paragraph by adopting the fine-ranking model to obtain a preset number of answer text paragraphs corresponding to the question text;

determining a ranking score corresponding to any answer text paragraph based on a first similarity between the question text and any answer text paragraph in the preset number of answer text paragraphs;

Inputting the question text and any answer text paragraph into the reading understanding model, acquiring a second answer text corresponding to the question text in any answer text paragraph, and acquiring a reading understanding score corresponding to the second answer text;

and determining a second score corresponding to any question-answer pair according to the sorting score, the reading understanding score and the similarity score.

6. The method of claim 5, wherein the determining the corresponding second score for any question-answer pair according to the ranking score, the reading understanding score, and the similarity score comprises:

acquiring a first weight corresponding to the sorting score;

acquiring a second weight corresponding to the reading understanding score;

acquiring a third weight corresponding to the similarity score;

and determining a second score corresponding to any question-answer pair according to the sorting score, the first weight, the reading understanding score, the second weight, the similarity score and the third weight.

7. The method of claim 5, wherein the method further comprises:

at least one text paragraph in the target format is added to the ES database.

8. A system evaluation device, comprising:

9. The apparatus of claim 8, wherein the result determining unit is configured to, when determining the first subset of question-answer pairs that satisfy the recall condition according to a text similarity between the set of question-answer pairs and the target question-answer result corresponding to the input data, specifically:

10. The method according to claim 9, wherein the result determining unit is configured to, when obtaining answer similarity between any question-answer pair in the question-answer pair set and a target question-answer result corresponding to the input data, specifically:

11. The apparatus of claim 9, wherein the result determining unit is configured to, when obtaining the similarity of the questions between the any question-answer pair and the target question-answer result, specifically:

12. The apparatus of claim 8, wherein the information retrieval-based question-answering system includes an ES database, a fine-ranking model, and a reading understanding model, and the score obtaining unit is configured to, when obtaining the second score corresponding to any one of the second question-answering pairs using the information retrieval-based question-answering system, specifically:

13. The apparatus of claim 12, wherein the score obtaining unit is configured to, when determining the second score corresponding to the any question-answer pair according to the ranking score, the reading understanding score, and the similarity score, specifically:

acquiring a first weight corresponding to the sorting score;

acquiring a second weight corresponding to the reading understanding score;

acquiring a third weight corresponding to the similarity score;

14. The apparatus of claim 12, wherein the apparatus further comprises a paragraph adding unit configured to:

at least one text paragraph in the target format is added to the ES database.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; it is characterized in that the method comprises the steps of,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.