CN112380836A

CN112380836A - Intelligent Chinese message question generating method

Info

Publication number: CN112380836A
Application number: CN202011261252.1A
Authority: CN
Inventors: 王华珍; 孙雨洁; 何霆; 李弼程; 缑锦
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2021-02-19

Abstract

The invention discloses a method for generating an intelligent message question, which comprises the following steps: s1: obtaining the question-answer pairs related to the Chinese emotion by using a crawler technology, and generating a triple corpus which can be used for model training through manual processing and triple extraction

S2: the relation-based template learning algorithm based on seq2seq is adopted, and a template question generation model M is constructed through training to realize the relation-basedGenerating a template question with a theme, and then performing theme text replacement on the template question to obtain a final generated question q_r(ii) a S3: the interface of the intelligent Chinese sentence generation system is used for receiving parameters required by the server, processing the model and returning the structured result. The template learning algorithm adopted by the invention utilizes the LSTM deep learning model to learn the general template of the question, and can learn the question generation mechanism on the semantic level, so that the generated question is more compliant and has important theoretical significance and practical value.

Description

Intelligent Chinese message question generating method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for generating intelligent Qiaoqing question sentences.

Background

In various methods for generating question, template matching is a classical mature method, and batch question generation can be rapidly performed aiming at named entities in a certain field on the premise of establishing related subject question templates. However, the question generated by this method is often of a single sentence pattern, and the generated question mostly belongs to several broad categories, so that it is difficult to diversify and complicate the question angle, and the language lacks diversity, and sometimes misleading information may be generated. Especially in the field of Qiaoqing question answering, it is more complicated to generate a question with specified subject and relationship from abundant and specific Qiaoqing corpus.

Disclosure of Invention

The invention mainly aims to overcome the limitations of rigidity and poor diversity of the sentence patterns of the question generated by the traditional template matching method, and provides the intelligent question generation method.

The invention adopts the following technical scheme:

a method for generating intelligent Chinese emotional question sentences comprises the following steps:

s1: obtaining the question-answer pairs related to the Chinese emotion by using a crawler technology, and generating a triple corpus which can be used for model training through manual processing and triple extraction

S2: adopting a template learning algorithm based on seq2seq, constructing a template question generation model M through training, realizing the generation of the template question based on the relation and the theme, and then performing theme text replacement on the template question to obtain the final generated question q_r；

S3: the interface of the intelligent Chinese sentence generation system is used for receiving parameters required by the server, processing the model and returning the structured result.

Specifically, the implementation process of step S1 further includes the following steps:

s11: using 'Chinese' or 'qiangbi' or 'waduo' as key words, crawling a webpage by using a crawler technology to obtain the love and answer pairs of the Chinese;

s12: the available question and answer corpus B is screened out manually_QAQ represents a question set and a represents an answer set;

s13: from question-answer pair material B using dependency parsing technique_QAExtract triplets from a tree

Where T represents a subject entity, R represents a relationship, O represents an object, a triple

Is a set of

Specifically, the implementation process of step S2 further includes the following steps:

s21: based on the obtained relation set R ═ { R ═ R₁,r₂,…,r_nN denotes the total number of relationships for a certain relationship r_iObtaining a question Q' related to the question, wherein

Selecting a question set Q 'with complete subjects, predicates and objects from the Q', and then selecting the optimal question set Q 'from the question set Q'Question q_i；

S22: for the optimal question q_iIs replaced with a subject text template tag SUB, thereby forming a template question q_i′；

S23: constructing a set Q of input and output data pairs according to the set R of relationships_train＝{q_train＝[(SUB,SEP,r_i),q_i′]Where SEP is a separator and input data is (SUB, SEP, r) } (i ═ 1,2, …, n)_i) The output data is q_i′；

S24: the obtained Q_trainAnd as an input set of the seq2seq model, training to obtain a model M, and mapping corresponding template question q' to input data (SUB, SEP, r) formed by an arbitrary relation r.

S25: replacing the SUB label in the template question q' with the subject t to obtain the finally generated question q_r。

Specifically, the implementation process of step S3 further includes the following steps:

s31: receiving the first two elements t and r of the fact triple through an interface, wherein t is a subject of a question sentence which a user wants to obtain, and r is a corresponding relation of a text t of a question text which the user asks;

s32: model processing; input data (SUB, SEP, r) formed based on the parameter r is sent into the model M to obtain an output template question q ', and a SUB label in q' is replaced by a subject t to obtain a finally generated question q_r；

S33: structuring output result question q using json interface_r。

Specifically, the material B is subjected to question-answer pair_QAExtract triplets from a tree

The method comprises the following steps: and the subject vocabulary constitutes T in the triple.

The method comprises the following steps: the predicate vocabulary, constitutes R in the triplet.

The method comprises the following steps: object vocabulary, constituting the O in the triplet.

Specifically, the input data (SUB, SEP, r) formed based on the parameter r is fed into the model M, which further comprises: generating a vector u through one-hot coding before data are input into a model M_i,u_i∈R^|D|And R is a real number set, and D is a dictionary generated according to the topics and the relations in the obtained Qiaoqing question-answer pairs.

As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:

(1) according to the method, on the premise that linguistic data which are as rich as possible and have specificity are collected and preprocessed, on the basis of fact triples extracted from the question and answer linguistic data, a training data pair set is constructed to serve as input of a model according to subjects and relations in the triples, template questions based on the relations and the subjects are generated, then subject texts of the generated template questions are replaced, and finally target questions are obtained;

(2) the LSTM (long short term memory network) employed in the present invention is a seq2seq framework based on encoding (Encoder) -decoding (Decoder). Compared with the traditional RNN, the LSTM adds a structure which can judge whether input information is useful or not, can selectively filter and forget information, and has good play in processing and predicting important events with long intervals in time sequence.

(3) The invention starts from the limitation of the traditional template matching method, combines a template method with a seq2seq model, provides a template learning algorithm based on seq2seq, and realizes automatic question generation; the method improves the current situation that manual editing generates question sentences, and overcomes the defects of rigor and poor diversity of the question sentence patterns generated by the template matching method. When the system for generating the Qiao emotion question is used actually, the target question can be quickly obtained by only calling the corresponding interface and transmitting the theme and the relation of the target question as parameters to the interface, and the system is accurate and convenient.

Drawings

FIG. 1 is a diagram of a project framework of a smart message question generation system according to an embodiment of the present invention;

FIG. 2 is a diagram of a result of a question-answer pair obtained from a Baidu known web page using a crawler technology in an embodiment of the present invention;

FIG. 3 is a diagram of a question-answer pair result generated using manual editing in an embodiment of the present invention;

FIG. 4 is a diagram illustrating the results of the data of the triplet set after structured specification according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a process of extracting an optimal question and forming a template question based on a question-answer relationship set according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating an Encode-Decoder sequence mapping process according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating a result of a one-hot encoding dictionary generated in an embodiment of the present invention;

FIG. 8 is a diagram illustrating an example of a sequence set of an input Encoder according to an embodiment of the present invention;

FIG. 9 is a diagram of an example sequence set for training a Decoder according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating the model prediction generation in an embodiment of the present invention;

FIG. 11 shows the result of the HTTP interface test according to the embodiment of the present invention

The invention is described in further detail below with reference to the figures and specific examples.

Detailed Description

The invention is further described below by means of specific embodiments.

The technical solutions in the embodiments of the present invention will be described in detail below with reference to the drawings in the embodiments of the present invention, and the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments.

The invention provides a system and a method for generating an intelligent message question, wherein the flow steps of the specific construction method are shown in figure 1:

firstly, crawling hundred degrees by using a crawler technology to know the content of a platform webpage, acquiring question and answer pairs related to the Chinese emotion, and generating a triple corpus set for model training through manual processing and triple extraction

Secondly, a template learning algorithm based on seq2seq is adopted, a template question generation model M is constructed through training, template question generation based on relation and subject can be achieved, and then subject text replacement is carried out on the template question to obtain a final generated question q_r；

And finally, developing an HTTP interface of the intelligent Chinese sentence generation system, receiving parameters required by the server and returning a structured result in a short time.

Specifically, in the step of obtaining the initial corpus, first, a crawler technology is used to crawl a Baidu known platform webpage, and a question-answer pair is obtained by using "Chinese" or "qiaoneng" or "wading" as a keyword, and the result is shown in fig. 2.

Secondly, the obtained question-answer pairs are processed in a manual editing mode to obtain the question-answer pair corpus B_QAAnd { Q, a }, wherein Q represents a question set and a represents an answer set. The manual editing mode can effectively solve the difficulties of flexible and changeable grammar and complex structure of Chinese, has incomparable advantages of strong pertinence and high precision, and obtains better performance in automatic evaluation and manual evaluation. For some excellent declarative corpora, high quality question-answer pairs can be generated in this way, and the result is shown in fig. 3.

Finally, using dependency parsing technique, feed B is parsed from question-answer pairs_QAExtract triplets from a tree

Where T represents a subject entity, R represents a relationship, and O represents an object. Triple unit

Is a set of

The results after structured specification are shown in fig. 4.

Triples extracted based on the above steps

Get an emotion question-answer relation set R ═ { R₁,r₂,…,r_nN denotes the total number of relationships for a certain relationship r_iObtain a question Q' associated therewith (wherein

). Selecting a question set Q 'with complete subjects, predicates and objects from Q', and then selecting a relation r of the predicate from the question set Q_iQuestion q with the least number of questions_iIs an optimal question. Question-question q_iIs replaced with a subject text template tag SUB, thereby forming a template question q_i'. The flow chart of the above steps is shown in fig. 5.

Constructing a set Q of input and output data pairs according to a set R of Qiaoqing relationships_train＝{q_train＝[(SUB,SEP,r_i),q_i′]Where SEP is a separator and input data is (SUB, SEP, r) } (i ═ 1,2, …, n)_i) The output data is q_i'. The obtained Q_trainThe model M is obtained through training as an input set of the seq2seq model, and a corresponding template question q' can be mapped to input data (SUB, SEP, r) formed by an arbitrary relation r, and the Encode-Decoder sequence mapping is shown in FIG. 6.

The input of the model is composed of a string of digital sequences translated by one-hot coding, the one-hot coding dictionary is generated according to the topics and the relations in the obtained Qiao emotion question-answer pairs, and the result is shown in fig. 7.

The sequence set of the input Encoder after encoding is shown in fig. 8. The sequence set used to train the Decoder after encoding is shown in fig. 9.

The simple test result of the model is shown in fig. 10, and it can be seen from the test result that the method provided by the invention can generate the qianqiang recognition question of the specified topic and relationship according to the topic and relationship in the triple, and the obtained question is more diversified.

In the process of calling the HTTP interface of the intelligent Chinese sentence generating system, a user is required to input the relation between the subject of a required question and a question to be asked about the subject. The interface uses json to structure the transmitted data. The called interface works normally, and after the theme and the relation parameters transmitted by the POST are received, the corresponding result in the json format is successfully returned. The test results are shown in fig. 11.

The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.

Claims

1. An intelligent message generating method is characterized by comprising the following steps:

2. The method of claim 1, wherein the step S1 further comprises the following steps:

Is a set of

3. The method of claim 2, wherein the step S2 further comprises the following steps:

s21: based on the obtained relation set R ═ { R ═ R₁，r₂，...，r_nN denotes the total number of relationships for a certain relationship r_iObtaining a question Q' related to the question, wherein

Selecting a question set Q ' with complete subjects, predicates and objects from the Q ', and then selecting an optimal question Q ' from the question Q_i；

S23: constructing a set Q of input and output data pairs according to the set R of relationships_train＝{q_train＝[(SUB，SEP，r_i)，q_i′]1, 2.., n), wherein SEP is a separator and the input data is (SUB, SEP, r)_i) The output data is q_i′；

4. The method of claim 3, wherein the step S3 further comprises the following steps:

S33: structuring output result question q using json interface_r。

5. The system of claim 2, wherein the query is generated from a question-answer pair of material B_QAExtract triplets from a tree

6. The system of claim 2, wherein the query is generated from a question-answer pair of material B_QAExtract triplets from a tree

The method comprises the following steps: vocabulary of predicates, constitute threeR in the tuple.

7. The system of claim 2, wherein the query is generated from a question-answer pair of material B_QAExtract triplets from a tree

8. The system of claim 4, wherein the input data (SUB, SEP, r) formed based on the parameter r is fed into the model M, and further comprising: generating a vector u through one-hot coding before data are input into a model M_i，u_i∈R^|D|And R is a real number set, and D is a dictionary generated according to the topics and the relations in the obtained Qiaoqing question-answer pairs.