CN116701604A

CN116701604A - Question and answer corpus construction method and device, question and answer method, equipment and medium

Info

Publication number: CN116701604A
Application number: CN202310840629.6A
Authority: CN
Inventors: 谢忠玉
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-09-05

Abstract

The embodiment of the application provides a method and a device for constructing a question-answer corpus, a question-answer method, equipment and a medium, and belongs to the technical field of financial science and technology. The method comprises the following steps: acquiring historical dialogue data; the historical dialogue data comprises historical question and answer data and dialogue identification data; performing data segmentation processing on the historical question-answer data according to the dialogue identification data to obtain initial question-answer data; screening the data quantity of the initial question-answer data according to the first preset data quantity to obtain candidate question-answer data; performing content identification on the original question data according to a preset target question model to obtain a target question label of the original question data; the target question label is used for indicating that the content category of the original question data is a question category or a non-question category; screening the candidate question-answer data according to the target question label to obtain target question-answer data; and constructing a question-answer corpus according to the target question-answer data. The method and the device can improve the construction efficiency of the question-answer corpus.

Description

Question and answer corpus construction method and device, question and answer method, equipment and medium

Technical Field

The application relates to the technical field of financial science and technology, in particular to a method and a device for constructing a question-answer corpus, a question-answer method, equipment and a medium.

Background

Currently, in the field of financial technology, intelligent answers to questions of a user, such as intelligent answers to insurance consultations of the user questions, can be made based on a question and answer corpus. In the related technology, a question-answer corpus is constructed or expanded in a manual labeling mode. However, the method needs to consume a long manual labeling time, so that the construction efficiency of the question-answer corpus is affected. Therefore, how to improve the construction efficiency of the question-answer corpus is a technical problem to be solved.

Disclosure of Invention

The embodiment of the application mainly aims to provide a method, a device, equipment and a medium for constructing a question-answer corpus, and aims to improve the construction efficiency of the question-answer corpus.

In order to achieve the above objective, a first aspect of an embodiment of the present application provides a method for constructing a question-answer corpus, where the method includes:

acquiring historical dialogue data; wherein the historical dialogue data comprises historical question-answer data and dialogue identification data; the dialogue identification data is used for identifying question and answer information of the historical question and answer data;

Performing data segmentation processing on the historical question-answer data according to the dialogue identification data to obtain initial question-answer data;

screening the data quantity of the initial question-answer data according to a first preset data quantity to obtain candidate question-answer data; wherein the candidate question-answer data comprises original question data;

performing content recognition on the original question data according to a preset target question model to obtain a target question label of the original question data; the target question label is used for indicating that the content category of the original question data is a question category or a non-question category;

screening the candidate question-answer data according to the target question label to obtain target question-answer data;

and constructing a question-answer corpus according to the target question-answer data.

In some embodiments, the target question label includes a question positive label, where the question positive label is used to indicate that a content category of the original question data is a question category;

the step of screening the candidate question-answer data according to the target question label to obtain target question-answer data comprises the following steps:

screening target question data from the original question data according to the question positive label;

Acquiring the data volume of the target question data in the candidate question-answer data to obtain a target data volume;

and if the target data volume is smaller than a second preset data volume, taking the candidate question-answer data as the target question-answer data.

In some embodiments, the data amount screening of the initial question-answer data according to the first preset data amount to obtain candidate question-answer data includes:

carrying out semantic recognition on the initial question-answer data to obtain an initial semantic vector;

carrying out semantic matching on the initial semantic vector and a preset semantic vector to obtain a first matching result;

taking the initial semantic vector of the first matching result representing semantic matching as a target semantic vector, and screening key question-answer data from the initial question data according to the target semantic vector;

and screening the data quantity of the key question-answer data according to the first preset data quantity to obtain the candidate question-answer data.

In some embodiments, the key question-answer data includes key question data;

the step of screening the data quantity of the key question-answer data according to the first preset data quantity to obtain the candidate question-answer data comprises the following steps:

Acquiring questioning data of the key questioning data to obtain questioning data quantity;

and if the questioning data volume is smaller than the first preset data volume, taking the key questioning and answering data as the candidate questioning and answering data.

In some embodiments, before the content recognition is performed on the original question data according to the preset target question model to obtain the target question label of the original question data, the method further includes training the target question model, and specifically includes:

acquiring sample question data and a sample question label of the sample question data; the sample question label is used for representing whether the content category of the sample question data is a question category or a non-question category;

performing content identification on the sample question data according to a preset original question model to obtain an original question label; the original question tag is used for representing whether the data category of the sample question data is a question category or a non-question category;

and carrying out parameter adjustment on the original problem model according to the sample question label and the original question label to obtain the target problem model.

In some embodiments, the acquiring sample challenge data comprises:

Acquiring training questioning data;

carrying out semantic recognition on the training question data to obtain training semantic vectors;

carrying out semantic matching on the training semantic vector and a preset semantic vector to obtain a second matching result;

and taking the training semantic vector of which the second matching result represents semantic matching as a key semantic vector, and screening the sample question data from the training question data according to the key semantic vector.

In order to achieve the above object, a second aspect of the embodiments of the present application provides a question-answering method, which includes:

obtaining data to be solved;

carrying out semantic matching on the data to be answered and target question-answering data in a preset question-answering corpus to obtain a third matching result; the question-answer corpus is constructed by the method according to the first aspect;

taking the target question-answer data of which the third matching result represents semantic matching as key question data; wherein the key question-answer data comprises key answer data;

and carrying out reply processing according to the key reply data to obtain answer data of the data to be solved.

In order to achieve the above object, a third aspect of the embodiments of the present application provides a device for constructing a question-answer corpus, where the device includes:

The data acquisition module is used for acquiring historical dialogue data; wherein the historical dialogue data comprises historical question-answer data and dialogue identification data; the dialogue identification data is used for identifying question and answer information of the historical question and answer data;

the data segmentation module is used for carrying out data segmentation processing on the historical question-answer data according to the dialogue identification data to obtain initial question-answer data;

the first data screening module is used for screening the data quantity of the initial question-answer data according to a first preset data quantity to obtain candidate question-answer data; wherein the candidate question-answer data comprises original question data;

the content identification module is used for carrying out content identification on the original question data according to a preset target question model to obtain a target question label of the original question data; the target question label is used for indicating that the content category of the original question data is a question category or a non-question category;

the second data screening module is used for screening the candidate question-answer data according to the target question label to obtain target question-answer data;

and the question-answer corpus construction module is used for constructing a question-answer corpus according to the target question-answer data.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes an electronic device, including a memory storing a computer program and a processor implementing the method according to the first aspect or the second aspect when the processor executes the computer program.

To achieve the above object, a fifth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of the first aspect or the second aspect.

The application provides a construction method, a device, equipment and a medium of a question and answer corpus, which are used for carrying out content identification on original question data through a target question model to obtain corresponding target question labels. And screening the candidate question and answer data through the target question and answer label to obtain target question and answer data, and constructing a question and answer corpus according to the target question and answer data. Therefore, the method for constructing the question-answer corpus in the manual labeling mode in the related technology is avoided, and the construction efficiency of the question-answer corpus can be improved by constructing the question-answer corpus through the data output by the target problem model. In addition, before content identification is carried out according to the target problem model, the embodiment of the application also carries out data segmentation processing on the historical question-answer data according to the dialogue identification data and carries out data quantity screening on the initial question-answer data according to the first preset data quantity, so that the data quantity of the data input into the target problem model is reduced, the data quality of the data input into the target problem model is ensured, the construction efficiency of a question-answer corpus is improved, and the construction accuracy of the question-answer corpus is ensured. When the application is applied to insurance scenes in financial science and technology, the construction efficiency and the construction accuracy of the insurance scene question-answer corpus can be improved.

Drawings

FIG. 1 is a flow chart of a method for constructing a question-answer corpus provided by an embodiment of the application;

fig. 2 is a flowchart of step S103 in fig. 1;

fig. 3 is a flowchart of step S204 in fig. 2;

fig. 4 is a flowchart further included before step S104 in fig. 1;

fig. 5 is a flowchart of step S401 in fig. 4;

fig. 6 is a flowchart of step S105 in fig. 1;

FIG. 7 is a flow chart of a question-answering method provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a device for constructing a question-answer corpus according to an embodiment of the present application;

fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

First, several nouns involved in the present application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.

Based on the above, the embodiment of the application provides a method and a device for constructing a question-answer corpus, a question-answer method, a device and a medium, aiming at improving the construction efficiency of the question-answer corpus.

The method and device for constructing the question-answer corpus, the question-answer method, the question-answer device and the medium provided by the embodiment of the application are specifically described through the following embodiments, and the recommendation method in the embodiment of the application is described first.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a construction method of a question-answer corpus, and relates to the technical field of finance and technology. The construction method of the question-answer corpus provided by the embodiment of the application can be applied to a terminal, a server and software running in the terminal or the server. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements a construction method of the question-answer corpus, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It should be noted that, in each embodiment of the present application, when related processing is required according to user information, user dialogue data, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through popup or jump to a confirmation page and the like, and after the independent permission or independent consent of the user is definitely acquired, the necessary relevant data of the user for enabling the embodiment of the application to normally operate is acquired.

Fig. 1 is an optional flowchart of a method for constructing a question-answer corpus according to an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S106.

Step S101, acquiring historical dialogue data; the historical dialogue data comprises historical question and answer data and dialogue identification data; the dialogue identification data is used for identifying question and answer information of the historical question and answer data;

Step S102, performing data segmentation processing on the historical question-answer data according to the dialogue identification data to obtain initial question-answer data;

step S103, screening the data quantity of the initial question-answer data according to the first preset data quantity to obtain candidate question-answer data; the candidate question and answer data comprise original question and answer data;

step S104, carrying out content identification on the original question data according to a preset target question model to obtain a target question label of the original question data; the target question label is used for indicating that the content category of the original question data is a question category or a non-question category;

step S105, screening the candidate question-answer data according to the target question label to obtain target question-answer data;

and S106, constructing a question-answer corpus according to the target question-answer data.

In the steps S101 to S106 shown in the embodiment of the present application, content recognition is performed on the original question data through the target question model, so as to obtain a corresponding target question label. And screening the candidate question and answer data through the target question and answer label to obtain target question and answer data, and constructing a question and answer corpus according to the target question and answer data. Therefore, the method for constructing the question-answer corpus in the manual labeling mode in the related technology is avoided, and the construction efficiency of the question-answer corpus can be improved by constructing the question-answer corpus through the data output by the target problem model. In addition, before content identification is carried out according to the target problem model, the embodiment of the application also carries out data segmentation processing on the historical question-answer data according to the dialogue identification data and carries out data quantity screening on the initial question-answer data according to the first preset data quantity, so that the data quantity of the data input into the target problem model is reduced, the data quality of the data input into the target problem model is ensured, the construction efficiency of a question-answer corpus is improved, and the construction accuracy of the question-answer corpus is ensured. When the application is applied to insurance scenes in financial science and technology, the construction efficiency and the construction accuracy of the insurance scene question-answer corpus can be improved.

In step S101 of some embodiments, historical dialog data of the question object and the answer object is obtained by means of a related application program code interface (Application Programming Interface, API) or the like. It is understood that the historical dialog data includes historical question-answer data and dialog identification data. The historical question and answer data comprises question data of a question object and answer data generated when the answer object answers the question data. The historical question-answer data is a set of a plurality of question data and a plurality of answer data, i.e., the historical question-answer data may include question data of a plurality of question objects. For example, when the present application is applied to a terminal, the terminal may be installed with an application having a reply function, such as an e-commerce application, a voice assistant, a blog forum, and the like. And acquiring question data corresponding to different question objects and corresponding reply data according to the API interface of the application, thereby constructing and obtaining historical question and reply data. When the application is applied to an insurance scenario, the historical question-answer data may include "what are those risks covered by this insurance? What is the premium? "questioning data, and" this insurance includes risks such as accident, hospitalization and benefits, and the premium is XX yuan each year. When an application includes multiple business functions, different historical question-answer data can be constructed according to different business functions in order to improve efficiency and accuracy of subsequent answer processing according to a question-answer corpus. For example, the terminal is installed with an insurance management application including an insurance consultation function, an insurance purchase function, and the like. At this time, the history question-answer data corresponding to the insurance consultation function and the history question-answer data corresponding to the insurance purchase function will be acquired respectively according to the above-described methods.

Further, the data category of the history question-answer data may be any one of text, image, and audio. For example, for an insurance scene, acquiring questioning data through text data input by questioning objects in an insurance management application customer service interface; and for the voice assistant, acquiring the audio data of the questioning object through the audio acquisition device of the terminal to obtain questioning data. It will be appreciated that in order to facilitate data analysis, when the historical question-answer data includes question data of different data categories and/or answer data of different data categories, normalization processing is required for the data categories. Specifically, data of an image category is converted into data of a text category by optical character recognition (Optical Charater Recognition, OCR), and data of an audio category is converted into data of a text category by a voice recognition method.

It is understood that since the history question-answer data is a set of a plurality of question data and a plurality of answer data, it is known that the history question-answer data has a large data amount. While different question data may be generated by different question objects or by the same question object at different times. Therefore, in order to be able to perform data slicing on the history question-answer data of a larger data volume for facilitating subsequent data analysis, the history dialogue data further includes a plurality of dialogue identification data. The dialogue identification data is used for identifying question and answer information of the question data and answer data corresponding to the question data, wherein the question and answer information comprises time information, login information, object information and the like. For example, when the question and answer information includes time information, the dialogue identification data is question data and the generation time of answer data corresponding to the question data, for example, if the dialogue identification data of the question data A1 is 2023, 2, 9, 0 minutes, the answer data corresponding to the question data A1 includes data A2 and data A3, the dialogue identification data of the data A2 is 2023, 2, 9, 1 minutes, and the dialogue identification data of the data A3 is 2023, 2, 9, 2 minutes. When the question-answer information includes login information, the dialogue identification data is a dialogue ID. It can be understood that when the terminal receives a question request instruction, for example, when the terminal receives a question request instruction generated by a question object through a sending control of the touch customer service interface, the terminal allocates a session ID to the question object. In the current session, the question data generated by the question object and the answer data generated by the answer object according to the question data all have the same session ID. The session refers to a session before the terminal does not receive a question ending instruction, and the question ending instruction may be generated by an interface closing control of a touch customer service interface of the question object. When the question-answer information includes object information, since different question objects have different login IDs in the application, the login IDs may also be used as dialogue identification data.

In step S102 of some embodiments, since the historical question-answer data has a larger data amount, the data segmentation process should be further performed on the historical question-answer data to obtain initial question-answer data, so that in a subsequent step, the initial question-answer data with a smaller data amount is used as a data analysis unit, thereby providing analysis efficiency and analysis accuracy of data analysis, and further improving construction efficiency and construction accuracy of the question-answer corpus. Specifically, the historical question-answer data is subjected to data segmentation according to the dialogue identification data, so that a plurality of initial question-answer data are obtained. For example, when the dialogue identification data is time data, the question data and the answer data in the preset duration are segmented into one initial question and answer data, and if the preset duration is two minutes, the question data A1, the data A2 and the data A3 are segmented into one initial question and answer data. When the dialogue identification data is a dialogue ID, question data and answer data having the same dialogue ID are cut into one initial question-answer data. When the dialogue identification data is a login ID, question data and answer data having the same login ID are cut into one initial question-answer data. It may be understood that, according to actual needs, the data slicing process may be performed on the historical question-answer data according to a combination of different session identification data, for example, the session identification data that is time data and the session identification data that is login ID are combined, which is not limited in detail in this embodiment of the present application.

It will be appreciated that since the generation of the session ID depends on the question end instruction, the question object will typically not touch the interface close control until the question data is completely answered. Therefore, compared with the method that the time data and the login ID are used as dialogue identification data, the method has the advantages that the session ID is used as dialogue identification data, the integrity of the initial question-answer data can be improved, namely, answer data in the initial question-answer data can answer the question data more comprehensively. For convenience of explanation, the following embodiments will specifically explain taking a session ID as dialogue identification data as an example.

In step S103 of some embodiments, the initial question-answer data includes question data and answer data of one session, so in order to avoid a phenomenon that the data volume of the initial question-answer data is still large due to the question object generating more question data in one session, the data volume of the initial question-answer data needs to be filtered according to the first preset data volume. Specifically, the initial question-answer data with the data quantity smaller than or equal to the first preset data quantity is used as candidate question-answer data, and the initial question-answer data with the data quantity larger than the first preset data quantity is filtered.

Referring to fig. 2, in some embodiments, step S103 includes, but is not limited to including, step S201 through step S204.

Step S201, carrying out semantic recognition on the initial question-answer data to obtain an initial semantic vector;

step S202, carrying out semantic matching on an initial semantic vector and a preset semantic vector to obtain a first matching result;

step S203, using the initial semantic vector of the semantic matching represented by the first matching result as a target semantic vector, and screening key question-answer data from the initial question data according to the target semantic vector;

and S204, screening the data quantity of the key question-answer data according to the first preset data quantity to obtain candidate question-answer data.

In step S201 of some embodiments, semantic recognition is performed on the initial question-answer data by means of NLP or the like, so as to obtain an initial semantic vector corresponding to the question data and an initial semantic vector corresponding to the answer data in the initial question-answer data.

In step S202 in some embodiments, preset semantic vectors are set in advance according to the data content of the desired candidate question-answer data, for example, the desired candidate question-answer data is data of an invalid conversation such as a small greeting, and preset semantic vectors corresponding to greeting data such as "hello", "happy to serve your". And carrying out semantic matching on the preset semantic vector and the initial semantic vector to obtain a first matching result. The first matching result is used for representing whether the initial semantic vector is a semantic vector corresponding to greeting data, namely whether question data or reply data corresponding to the initial semantic vector is greeting data.

In step S203 of some embodiments, when the first matching result indicates a semantic match, it indicates that the semantics of the initial semantic vector are content such as a crumble greeting. At this time, the initial semantic vector is set as a target semantic vector, and question data or answer data corresponding to the target semantic vector is set as target semantic data. Filtering target semantic data in the initial question data, and taking the rest data in the filtered initial question data as key question-answer data, so that the preliminary data cleaning of question data and/or answer data in the initial question data is realized.

In step S204 of some embodiments, the data amount of the key question-answer data may still be large due to the preliminary data cleaning. Therefore, the key question-answer data are screened again according to the first preset data quantity, and candidate question-answer data are obtained. Specifically, key question-answer data with the data quantity smaller than or equal to the first preset data quantity is used as candidate question-answer data, and key question-answer data with the data quantity larger than the first preset data quantity is filtered.

Referring to fig. 3, in some embodiments, the key question data includes key question data, and step S204 includes, but is not limited to, steps S301 through S302.

Step S301, acquiring questioning data of key questioning data to obtain questioning data quantity;

and step S302, if the questioning data volume is smaller than the first preset data volume, the key questioning and answering data are used as candidate questioning and answering data.

It will be appreciated that, since the data volume of the reply data in the key question and answer data is generally associated with the data volume of the question data, i.e. the more question data the question object generates, the reply object corresponds to reply data that needs to generate the same data volume or a greater data volume. Therefore, the aim of constraining the whole data volume of the candidate question-answer data can be achieved by constraining the data volume of the question data in the key question-answer data. Unlike the above embodiment, the above embodiment is to constrain the overall data amount of the key question-answer data by the first preset data amount, and the embodiment of the present application is to constrain the question data in the key question-answer data by the first preset data amount. Therefore, the embodiment of the application not only can ensure that the whole data volume of the candidate question-answer data obtained by screening is less, but also can ensure that the data volume of the question data in the candidate question-answer data obtained by screening is controllable.

In step S301 of some embodiments, the questioning data in the key questioning data is used as key questioning data, and the data volume of the key questioning data in the key questioning data is determined, so as to obtain the questioning data volume.

In step S302 of some embodiments, the questioning data amount is compared with a first preset data amount. If the first preset data amount is 3, the corresponding key questioning data contains less key questioning data. While less key question data facilitates both question content semantic analysis and answer content semantic analysis. Therefore, the key question-answer data whose question data amount is smaller than the first preset data amount is taken as candidate question-answer data.

Referring to fig. 4, in some embodiments, before step S104, the method provided in the embodiments of the present application further includes training the target problem model, including, but not limited to, steps S401 to S403.

Step S401, acquiring sample questioning data and sample questioning labels of the sample questioning data; the sample question label is used for representing whether the content category of the sample question data is a question category or a non-question category;

step S402, carrying out content identification on sample question data according to a preset original question model to obtain an original question label; the original questioning tag is used for representing whether the content category of the sample questioning data is a question category or a non-question category;

And step S403, carrying out parameter adjustment on the original problem model according to the sample question label and the original question label to obtain a target problem model.

In step S401 of some embodiments, sample question data and a sample question tag of the sample question data obtained by way of labeling or the like are acquired. The sample question label is used for representing whether the content category of the sample question data is a question category or a non-question category. It can be understood that the acquiring manner of the sample question data can refer to the acquiring manner of the historical question-answer data, and the embodiment of the application is not repeated. Similarly, the data category of the sample question data may be any one of text, image and audio, which is the same as the historical question-answer data, and the embodiment of the application is not particularly limited. It should be noted that, since the sample question data is only used for training the target question model, the training target of the target question model is to accurately identify the content category of the sample question data, that is, the training of the target question model is irrelevant to the reply data. Thus, unlike the question data in the historical question-answer data, the question data in the historical question-answer data must have a mapping relationship with a certain answer data, and the sample question data may or may not include such a mapping relationship, i.e., the sample question data may be data for which only the question does not correspond to the answer. When the semantic content represented by the reply data is the semantic content represented by the answer question data, the question data and the reply data are determined to have a mapping relation.

Referring to fig. 5, in some embodiments, step S401 includes, but is not limited to, steps S501 through S504.

Step S501, acquiring training question data;

step S502, carrying out semantic recognition on training question data to obtain training semantic vectors;

step S503, carrying out semantic matching on the training semantic vector and a preset semantic vector to obtain a second matching result;

step S504, training semantic vectors of semantic matching represented by the second matching result are used as key semantic vectors, and sample question data are screened out from the training question data according to the key semantic vectors.

In step S501 of some embodiments, training question data is obtained by means of a related API interface or the like. It can be understood that the training question data is data generated by a question object, and a specific acquisition mode of the training question data can refer to an acquisition mode of historical question data, which is not described in detail in the embodiment of the present application.

In step S502 of some embodiments, semantic recognition is performed on the training question data by means of NLP or the like, so as to obtain training semantic vectors corresponding to the training question data.

In step S503 of some embodiments, preset semantic vectors corresponding to greeting data such as a small-done greeting, for example, preset semantic vectors corresponding to greeting data such as "hello", "thank you for answering". And carrying out semantic matching on the preset semantic vector and the training semantic vector to obtain a second matching result. The second matching result is used for representing whether the training semantic vector is a semantic vector corresponding to the greeting data, namely whether the training question data corresponding to the training semantic vector is the greeting data.

In step S504 of some embodiments, when the second matching result indicates a semantic match, it indicates that the semantics of the training semantic vector are content such as a crumble greeting. At this time, the training semantic vector is used as a key semantic vector, and training question data corresponding to the key semantic vector is used as key semantic data. And filtering the key semantic data, and taking the rest training question data after filtering as sample question data, thereby realizing data cleaning of the training question data.

In step S402 of some embodiments, an original problem model based on a classifier structure or a transform structure is built in advance, and thus, the model structure of the original problem model may be any one of the following: bi-directional encoders (Bidirectional Encoder Representation from Transformer, BERT) from transform, support vector machines, logistic regression, decision trees, etc. And taking the sample question data as input data of an original question model, carrying out content identification on the sample question data through the original question model, and determining whether the sample question data is question data or non-question data according to an identification result, so as to obtain an original question label corresponding to the sample question data. And if the original question label represents that the content category of the sample question data is a question category, indicating that the sample question data is question data. And if the original question label represents that the content category of the sample question data is a non-question category, indicating that the sample question data is non-question data. The question data is data which can accurately determine the questioning contents, and the non-question data is data which cannot acquire the questioning contents. For example, "i want to consult the guarantee scope of XX insurance" is question data, "i call for small, 20 years old" is non-question data.

In step S403 of some embodiments, a loss calculation is performed on the sample question label and the original question label according to a preset loss function, so as to determine a content recognition error of the original question model. And carrying out parameter adjustment on the original problem model according to the calculated loss value to obtain a target problem model with more accurate content identification.

In step S104 of some embodiments, the question data in the candidate question-answer data obtained by screening is taken as the original question data. The target problem model which can accurately identify the data content to determine the content category is obtained through pre-training. And taking the original question data as input data of the target question model, carrying out content identification on the original question data through the target question model, and determining the content category of the original question data according to the identification result so as to obtain a target question label corresponding to the original question data.

In step S105 of some embodiments, the target question tag is used to characterize that the content category of the original question data in the candidate question data is a question category or a non-question category, so that the candidate question data including the original question data may be further screened according to the target question tag to obtain the desired target question data. The expected target question-answer data can be candidate question-answer data of which the content categories of the included original question data are all question categories; or, the ratio of the original question data including the question class to the original question data including the non-question class is the candidate question-answer data of the preset ratio, which is not particularly limited in the embodiment of the present application.

Referring to FIG. 6, in some embodiments, the target question label includes a question positive label that is used to indicate that the content category of the original question data is a question category. Step S105 includes, but is not limited to, steps S601 to S603.

Step S601, screening target question data from original question data according to the question positive label;

step S602, obtaining the data volume of target question data in the candidate question-answer data to obtain the target data volume;

and step S603, if the target data volume is smaller than the second preset data volume, taking the candidate question-answer data as target question-answer data.

In step S601 of some embodiments, the target question label includes a question positive label and a question negative label. The problem positive label is used for indicating that the content category of the original question data is a problem category, and the problem negative label is used for indicating that the content category of the original question data is a non-problem category. And screening the plurality of original question data according to the question positive labels so as to take the original question data with the question positive labels as target question data.

In step S602 of some embodiments, the data amount of the target question data in the candidate question data is acquired, that is, the data amount of the question data generated by the question object in the candidate question-answer data is determined, so as to obtain the target data amount.

In step S603 of some embodiments, a second preset data amount with a smaller value is set, for example, the value of the second preset data amount is 2. And if the target data volume is smaller than the second preset data volume, indicating that the corresponding candidate question-answer data only comprises one target question-answer data. At this time, all the answer data in the candidate question-answer data are data for solving the target question data. Therefore, the candidate question-answer data with the target data quantity smaller than the second preset data quantity is used as the target question-answer data, and the situation that the content of the data needs to be segmented again is avoided, so that the mapping relation between the answer data and the question data in the candidate question-answer data is determined, namely, which answer data are used for solving one question data in the candidate question-answer data, and which answer data are used for solving the other question data in the candidate question-answer data is determined.

In step S106 of some embodiments, the target question-answer data obtained by screening is integrated, and a question-answer corpus is constructed. Specifically, clustering processing can be performed on a plurality of target question-answer data to cluster the target question-answer data with similar semantics of the original question-answer data into a question-answer corpus, for example, a question-answer corpus with original question-answer data semantics of consultation B insurance and a question-answer corpus with original question-answer data semantics of consultation C insurance can be obtained through clustering. Therefore, the processing efficiency of the subsequent reply processing of the data to be answered according to the question and answer corpus can be improved.

According to the method for constructing the question-answer corpus, provided by the embodiment of the application, the original question data is subjected to content identification through the target question model, and the corresponding target question label is obtained. And screening the candidate question and answer data through the target question and answer label to obtain target question and answer data, and constructing a question and answer corpus according to the target question and answer data. Therefore, the method for constructing the question-answer corpus in the manual labeling mode in the related technology is avoided, and the construction efficiency of the question-answer corpus can be improved by constructing the question-answer corpus through the data output by the target problem model. In addition, before content identification is performed according to the target problem model, data segmentation processing is performed according to the dialogue identification data, and data quantity screening is performed on the initial question-answer data according to the first preset data quantity, so that the data quantity of the data input to the target problem model is reduced, the data quality of the data input to the target problem model is ensured, the construction efficiency of a question-answer corpus is improved, and the construction accuracy of the question-answer corpus is ensured. When the application is applied to insurance scenes in financial science and technology, the construction efficiency and the construction accuracy of the insurance scene question-answer corpus can be improved.

Referring to fig. 7, an embodiment of the present application also provides a question-answering method including, but not limited to, steps S701 to S704.

Step S701, obtaining data to be solved;

step S702, carrying out semantic matching on the data to be answered and target question-answering data in a preset question-answering corpus to obtain a third matching result; the question-answer corpus is constructed according to a construction method of the question-answer corpus;

step S703, using the target question-answer data of which the third matching result represents semantic matching as key question data; wherein the key question-answer data includes key answer data;

step S704, carrying out reply processing according to the key reply data to obtain answer data of the data to be solved.

In step S701 of some embodiments, data to be solved is acquired through a related API interface. It can be understood that the to-be-answered data is to-be-answered question data generated by the question object, and the obtaining mode of the to-be-answered data can refer to the obtaining mode of the historical question-answering data, which is not repeated in the embodiment of the application. Similarly, the data category of the data to be answered may be any one of text, image and audio, which is the same as the historical question and answer data, and the embodiment of the application is not particularly limited.

In step S702 of some embodiments, the to-be-solved data is semantically matched with the target question-and-answer data obtained according to the question-and-answer corpus construction method described in any of the above embodiments, and a corresponding third matching result is obtained. It can be understood that when the constructed question-answer corpus includes target question-answer data with various semantic contents, for example, target question-answer data including a query with semantic contents being insurance purchase, target question-answer data including a query with semantic contents being train ticket information, target question-answer data including a query with semantic contents being airline ticket information, and the like, the to-be-answered data needs to be semantically matched with each target question-answer data in the question-answer corpus to determine the semantic contents of the to-be-answered data. In addition, when a plurality of question-answer corpuses are constructed by clustering a plurality of target question-answer data, a library label for characterizing the semantic content of all the target question-answer data in the question-answer corpuses can be constructed for each question-answer corpus. For example, a first question-answer corpus with a library label being insurance and a second question-answer corpus with a library label being ticket are constructed. At this time, only the data to be solved and the database label are subjected to semantic matching, so that the semantic matching efficiency is improved.

In steps S703 to S704 of some embodiments, target question-answer data whose third matching result indicates semantic matching is taken as key question-answer data, and answer data in the key question-answer data is taken as key answer data. And carrying out corresponding reply processing according to the key reply data and the actual application scene, for example, when the data to be solved is data generated in an input mode in a terminal application interface by a questioning object, taking the key reply data as answer data, and displaying the answer data in the same application interface. When the data to be answered is the data acquired by the voice assistant, the key answer data is converted into audio data, and the converted audio data is used as answer data.

The embodiment of the application carries out reply processing through the question-answer corpus described in the embodiment, and can improve the accuracy of reply processing because the question-answer corpus constructed according to the embodiment has higher accuracy.

Referring to fig. 8, an embodiment of the present application further provides a device for constructing a question-answer corpus, which may implement the method for constructing a question-answer corpus, where the device includes:

a data acquisition module 801, configured to acquire historical dialogue data; the historical dialogue data comprises historical question and answer data and dialogue identification data; the dialogue identification data is used for identifying question and answer information of the historical question and answer data;

The data segmentation module 802 is configured to perform data segmentation processing on the historical question-answer data according to the dialogue identification data to obtain initial question-answer data;

the first data screening module 803 is configured to perform data size screening on the initial question-answer data according to a first preset data size, so as to obtain candidate question-answer data; the candidate question and answer data comprise original question and answer data;

the content recognition module 804 is configured to perform content recognition on the original question data according to a preset target question model, so as to obtain a target question tag of the original question data; the target question label is used for indicating that the content category of the original question data is a question category or a non-question category;

the second data screening module 805 is configured to screen the candidate question-answer data according to the target question label, so as to obtain target question-answer data;

the question and answer corpus construction module 806 is configured to construct a question and answer corpus according to the target question and answer data.

The specific implementation of the construction device of the question-answer corpus is basically the same as the specific embodiment of the construction method of the question-answer corpus, and is not described herein.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the construction method of the question-answer corpus when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device of another embodiment, the electronic device including:

the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solution provided by the embodiments of the present application;

the memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solution provided in the embodiments of the present disclosure is implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes the method for constructing the question-answer corpus to execute the embodiments of the present disclosure;

an input/output interface 903 for inputting and outputting information;

the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

A bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program which realizes the construction method of the question-answer corpus when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by persons skilled in the art that the embodiments of the application are not limited by the illustrations, and that more or fewer steps than those shown may be included, or certain steps may be combined, or different steps may be included.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method for constructing a question-answer corpus, the method comprising:

2. The method of claim 1, wherein the target question label comprises a question positive label for indicating that a content category of the original question data is a question category;

3. The method of claim 1, wherein the performing data amount screening on the initial question-answer data according to a first preset data amount to obtain candidate question-answer data includes:

4. The method of claim 3, wherein the key question-answer data comprises key question data;

5. The method according to any one of claims 1 to 4, wherein before said content recognition of said original question data according to a preset target question model, obtaining a target question label of said original question data, said method further comprises training said target question model, in particular comprising:

6. The method of claim 5, wherein the obtaining sample challenge data comprises:

acquiring training questioning data;

7. A question-answering method, the method comprising:

obtaining data to be solved;

carrying out semantic matching on the data to be answered and target question-answering data in a preset question-answering corpus to obtain a third matching result; wherein the question-answer corpus is constructed by the method according to any one of claims 1 to 6;

8. A device for constructing a question-answer corpus, the device comprising:

9. An electronic device comprising a memory storing a computer program and a processor implementing the method of any one of claims 1 to 6 or the method of claim 7 when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method of any one of claims 1 to 6 or the method of claim 7.