CN108038234B

CN108038234B - Automatic question template generating method and device

Info

Publication number: CN108038234B
Application number: CN201711436114.0A
Authority: CN
Inventors: 邹辉
Original assignee: Zhongan Information Technology Service Co Ltd
Current assignee: Shanghai Zhongan Information Technology Service Co ltd
Priority date: 2017-12-26
Filing date: 2017-12-26
Publication date: 2021-06-15
Anticipated expiration: 2037-12-26
Also published as: CN108038234A

Abstract

The invention discloses a method and a device for automatically generating a question template, and belongs to the technical field of intelligent question answering. The method comprises the following steps: preparing a question log corpus; performing word segmentation and part-of-speech tagging on the log corpus; carrying out named entity identification and replacement; performing semantic replacement; and (5) performing frequent item set mining to generate a question template. The method and the device not only improve the efficiency of question template generation and greatly save manual resources, but also can evaluate the generated question template, autonomously and continuously expand the question template base and improve the quality of the knowledge base of the intelligent question-answering system.

Description

Automatic question template generating method and device

Technical Field

The invention relates to the technical field of intelligent question answering, in particular to a method and a device for automatically generating a question template.

Background

Currently, more and more enterprises are undertaking a lot of users' after-sales service or pre-sales consultation. Because the number of users is exponentially increased, a large amount of manual resources are consumed for answering all the user consultations completely in a manual mode, a large number of knowledge points are relatively concentrated, and manual answering usually comprises a large amount of repeated labor, so that the intelligent question-answering system is produced, the intelligent question-answering system can automatically answer questions input by the users, and the efficiency is greatly improved.

The technical principle of the intelligent question-answering system is based on question template matching, knowledge base retrieval and the like. The question template is a specific symbol label sequence formed after recognition and replacement of a question, corresponding answers are added to the question of the question template, and when the question is the same as the template or has high similarity, the template matching technology can match and answer the question. The difficulty of the question template matching technology is how to generate question templates efficiently and sustainably. The traditional question template generation needs manual template setting aiming at a specific sentence pattern, and is not only complicated but also poor in coverage; when the knowledge base is updated, new templates which cannot be covered in the template base also need to be manually set and evaluated, and maintainability and self-learning are poor. In the related patents disclosed at present, no technical solution that is improved correspondingly to the above technical problems has been found, for example, in patent application 201611076382.1 (a method and an apparatus for automatic question-answering template matching), a matching problem of a question to be answered is obtained by determining a subset of a template problem set of each word segmentation corresponding to the question to be answered, and template matching efficiency and accuracy of an automatic question-answering system are improved, but the template automatic generation and the generation of a template quality evaluation problem are not involved.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for automatically generating a question template. The technical scheme is as follows:

in a first aspect, a method for automatically generating a question template is provided, where the method includes:

preparing a question log corpus; performing word segmentation and part-of-speech tagging on the log corpus; carrying out named entity identification and replacement; performing semantic replacement; and (5) performing frequent item set mining to generate a question template.

With reference to the first aspect, in a first possible implementation manner, preparing a question log corpus includes:

and obtaining a question log corpus, and preprocessing the question log corpus, including punctuation removal, illegal symbolic removal and case-case conversion of words.

With reference to the first aspect, in a second possible implementation manner, performing word segmentation and part-of-speech tagging on the log corpus includes:

and performing word segmentation on the log corpus by combining a word segmentation method of an industry dictionary.

With reference to the first aspect, in a third possible implementation manner, the named entity identifying and replacing includes:

and carrying out named entity identification on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and replacing the universal entities with corresponding entity labels.

With reference to the first aspect, in a fourth possible implementation manner, performing semantic replacement includes:

and searching words after the words of the question in the question log corpus are segmented by a semantic web, abstracting and unifying words with the same or similar paraphrases into labels according to the paraphrases of the words, and performing corresponding replacement to generate a symbol label sequence consisting of named entities and semantic concepts after semantic replacement.

With reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner, performing frequent item set mining to generate a question template includes:

by setting a threshold range, a frequent item set is obtained from a candidate item set of a question corpus log, and a question template is generated.

With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner, performing frequent item set mining to generate a question template includes:

and screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to the default sequence of the items to generate a question template.

With reference to the fifth to sixth possible implementation manners of the first aspect, in seventh to eight possible implementation manners, the method further includes:

sentence vector representation is carried out on the questions under the screened question template by utilizing a preset sentence vector model;

calculating the clustering compactness of the question template by using the following calculation formula:

screening question templates with clustering compactness larger than a compactness threshold according to a preset template clustering compactness threshold;

searching and comparing the screened question templates in a template library, and if the screened question templates do not exist in the template library, storing the screened question templates in the template library;

wherein, in the calculation formula, CP_jFor the calculated cluster compactness of the j-th question template, X_iIs a sentence vector of the ith question under the jth question template, W_jThe average value of all sentence vectors of the cluster corresponding to the jth question sentence template; omega_jIs the sum of the modular lengths of all sentence vectors of the cluster corresponding to the jth question sentence template, and i and j are integers which are more than or equal to 1.

With reference to the seventh to eighth possible implementation manners of the first aspect, in ninth to tenth possible implementation manners, the preset sentence vector model is a deep learning encoder model Skip-Thoughts.

With reference to the seventh to eighth possible implementations of the first aspect, in eleventh to twelfth possible implementations, the method further includes:

and adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates, and storing the complete question and answer pairs in a template library.

In a second aspect, an apparatus for automatically generating a question template is provided, which includes:

the preparation module is used for preparing a query sentence log corpus; the word segmentation and part of speech tagging module is used for carrying out word segmentation and part of speech tagging; the named entity recognition module is used for recognizing and replacing named entities; the semantic replacing module is used for performing semantic replacement; and the frequent item set mining module is used for mining frequent item sets to generate question templates.

With reference to the second aspect, in a first possible implementation manner, the preparation module includes an obtaining module and a preprocessing module, where the obtaining module is configured to obtain a question log corpus, and the preprocessing module is configured to perform preprocessing on the question log corpus, including punctuation removal, illegal symbol removal, and word case conversion.

With reference to the second aspect, in a second possible implementation manner, the named entity identifying module is configured to: and carrying out named entity identification on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and replacing the universal entities with corresponding entity labels.

With reference to the second aspect, in a third possible implementation manner, the semantic replacement module is configured to: and searching words after the words of the question in the question log corpus are segmented by a semantic web, abstracting and unifying words with the same or similar paraphrases into labels according to the paraphrases of the words, and performing corresponding replacement to generate a symbol label sequence consisting of named entities and semantic concepts after semantic replacement.

With reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner, the frequent item set mining module is configured to: and screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to the default sequence of the items to generate a question template.

With reference to the third to four possible implementation manners of the second aspect, in fifth to sixth possible implementation manners, the apparatus further includes:

the sentence vector representation module is used for carrying out sentence vector representation on the questions under the screened question template by utilizing a preset sentence vector model;

the cluster compactness calculation module is used for calculating the cluster compactness of the question template by using the following calculation formula:

the screening module is used for screening question templates with clustering compactness larger than a compactness threshold according to a preset template clustering compactness threshold;

the confirming and storing module is used for searching and comparing the screened question templates in the template library, and storing the screened question templates in the template library if the screened question templates do not exist in the template library;

wherein, in the calculation formula, CP_jFor the calculated cluster compactness of the j-th question template, X_iIs a sentence vector of the ith question under the jth question template, W_jThe average value of all sentence vectors of the cluster corresponding to the jth question sentence template; omega_jCorresponding to the jth question templateAnd (4) the modular length sum of all sentence vectors of the cluster, i and j are integers which are more than or equal to 1.

With reference to the fifth to sixth possible implementation manners of the second aspect, in seventh to eighth possible implementation manners, the preset sentence vector model is a deep learning encoder model Skip-roads.

With reference to the fifth to sixth possible implementation manners of the second aspect, in ninth to tenth possible implementation manners, the apparatus further includes:

and the answer adding module is used for adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates and storing the complete question and answer pairs in the template library.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

1. through the semantic replacing step, the words with multiple words and one meaning are abstracted and unified according to the paraphrases, so that the generalization capability of the semantics is increased;

2. by mining the frequent item sets, the frequent item sets are found from the candidate item sets, and the question template is generated, so that the efficiency of generating the question template is improved, and a large amount of manual resources are saved;

3. the method comprises the steps that an item set meeting requirements is screened from a frequent item set according to a preset item set frequency threshold range and a preset item set length threshold range to generate a question template, sentences with similar structures and public word sequences can be clustered, and a question template with higher quality is obtained;

4. by using the sentence vector representation of the preset sentence vector model, the cluster compactness calculation and the screening of question templates meeting the requirements, the quality evaluation of the generated templates in semantic dimensions can be realized, so that high-quality question templates with higher accuracy are obtained;

5. searching and comparing the screened question templates in a template library, and if the screened question templates do not exist in the template library, storing the screened question templates in the template library so as to facilitate the effective updating of the generated templates in the template library;

6. and adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates, storing the complete question and answer pairs in a template library, ensuring the integrity of the question and answer pairs of the templates in the template library, and realizing the answer matching of automatically generated question templates.

In summary, the method and the device for automatically generating the question template provided by the embodiment of the invention can efficiently and automatically generate the question template, can evaluate the quality of the generated question template, autonomously and continuously expand or update the question template library, improve the quality of the intelligent question-answering system knowledge base, and can be widely popularized and applied in the technical fields of intelligent question-answering and the like which need customer service.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of an automatic question template generation method provided in embodiment 1 of the present invention;

fig. 2 is a flowchart of an automatic question template generation method provided in embodiment 2 of the present invention;

fig. 3 is a schematic structural diagram of an automatic question template generation apparatus according to embodiment 3 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

The invention aims to solve the problems of generation and evaluation of question templates in the template matching technology of an intelligent question-answering system. In the question log of the question-answering system, the method and the device for automatically generating the question template provided by the embodiment of the invention can automatically generate the question template candidate set of the question-answering system by mining the frequent item set from the question log corpus, thereby automatically generating the question template, evaluating the quality of the template and establishing a self-learning mechanism of the intelligent question-answering system. Compared with the traditional method for establishing the template by utilizing the manual rule, the method and the device not only improve the efficiency of generating the question template and greatly save manual resources, but also can evaluate the generated question template, autonomously and continuously expand the question template library and improve the quality of the knowledge base of the intelligent question-answering system, so that the method and the device for automatically generating the question template provided by the embodiment of the invention can be widely popularized and applied in the technical fields of intelligent question-answering and the like which need customer service.

Example 1

Fig. 1 is a flowchart of an automatic question template generation method according to an embodiment of the present invention. As shown in fig. 1, the method for automatically generating a question template according to an embodiment of the present invention includes the following steps:

101. and preparing a question log corpus.

Specifically, a query sentence log corpus is obtained, and the query sentence log corpus is preprocessed, including but not limited to punctuation removal, illegal symbol removal, case-to-case word conversion, and the like.

102. And performing word segmentation and part-of-speech tagging on the log corpus.

Specifically, the method combines the word segmentation method of the industry dictionary to segment words of the log corpus. According to the needs or different industries, different types of industry dictionaries can be created, and particularly when a specific vertical industry is involved, a word segmentation method combined with the industry dictionaries is adopted to segment words of corresponding log linguistic data, so that a good word segmentation effect can be achieved.

103. Named entity recognition and replacement is performed.

Specifically, named entity recognition is performed on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and the universal entities are replaced with corresponding entity tags.

It should be noted that the above-mentioned log corpus preprocessing, word segmentation and part-of-speech tagging, and named entity recognition and replacement are sentence analysis processing techniques commonly used in the field of natural language processing, and any possible technical means or manners in the prior art may also be used to implement the above-mentioned processes, and in order to avoid redundancy, detailed description is not provided here.

104. Semantic replacement is performed.

Specifically, words after the words of the question in the question log corpus are segmented are searched through a semantic network, words with the same or similar paraphrases are abstractly unified into labels according to the paraphrases of the words, corresponding replacement is carried out, and a symbol label sequence formed by the named entities and semantic concepts after semantic replacement is generated. Here the semantic web preferably employs the chinese semantic web HowNet. The purpose of this procedure is to increase the generalization capability of semantics through semantic abstraction, where words in a sentence that cannot be found in the semantic web or identified by a named entity can be directly ignored.

105. And (5) performing frequent item set mining to generate a question template.

Specifically, by setting a threshold range, a frequent item set is obtained from a candidate item set of a question corpus log, and a question template is generated. The log linguistic data converted into the symbol sequence is subjected to template mining, and the key of template generation is that frequent item sets are clustered from a large number of different question and sentence transformed tag symbol sequences, and can express a sentence form backbone part to a certain extent, so that automatic generation of a question and sentence template is realized according to the clustered frequent item sets. Preferably, according to the preset frequency threshold range and the preset item set length threshold range, a preset association rule algorithm is used to obtain a desired frequent item set from the symbol tag sequence so as to generate a question template. That is, frequent item set mining is performed using a predetermined association rule algorithm, and a frequent item set is found from the candidate item set. Here, the predetermined association rule algorithm may be selected to use any possible association rule algorithm in the prior art as needed, preferably Apriori algorithm.

Example 2

Fig. 2 is a flowchart of an automatic question template generation method provided in embodiment 2 of the present invention. As shown in fig. 2, the method for automatically generating a question template according to an embodiment of the present invention includes the following steps:

201. and obtaining a question log corpus, and preprocessing the question log corpus, including but not limited to punctuation removal, illegal symbol removal and word case and case conversion.

It should be noted that any possible obtaining method in the prior art may be adopted as the obtaining method of the query log corpus, and the embodiment of the present invention is not particularly limited thereto; the process or manner for implementing the pre-processing of the log corpus is not limited to the above, and any possible technical means or manner in the prior art may be adopted, which is not described herein in detail.

202. And performing word segmentation and part-of-speech tagging on the log corpus by combining a word segmentation method of an industry dictionary.

According to the needs or different industries, different types of industry dictionaries can be created, and particularly when a specific vertical industry is involved, a word segmentation method combined with the industry dictionaries is adopted to segment words of corresponding log linguistic data, so that a good word segmentation effect can be achieved.

It should be noted that the step 202 is not limited to the above operation, and any possible method or method in the prior art may be adopted, and the embodiment of the present invention is not limited thereto.

203. And carrying out named entity identification on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and replacing the universal entities with corresponding entity labels.

It should be noted that the named entity identification and replacement can be implemented by any possible technical means or manner in the prior art, and will not be described in detail here to avoid redundancy.

204. And searching words after the words of the question in the question log corpus are segmented by a semantic web, abstracting and unifying words with the same or similar paraphrases into labels according to the paraphrases of the words, and performing corresponding replacement to generate a symbol label sequence consisting of named entities and semantic concepts after semantic replacement.

Because of the large number of one-meaning-multiple-word situations in Chinese, it is desirable to abstract concrete words into a word sense representation so as to increase the generalization capability of semantics. Here the chinese semantic web HowNet is preferably used for semantic abstraction substitution. The specific operation mode is that the words after word segmentation are searched in a semantic network, paraphrases of the words are arranged in the semantic network, the words with the same or similar paraphrases are abstracted and unified into labels according to the paraphrases of the words, and corresponding replacement is carried out. For example, the word "hepatitis" is defined as "disease" in HowNet, and the word "cold" is also defined as "disease" in semantic web, and these words representing "disease" are collectively replaced with "disease" in the sentence, thereby achieving the goal of semantic abstraction. And words in the sentence which cannot be found in the semantic web or identified by the named entity can be directly ignored. After the processing of the step, the corpus question sentence is converted into a named entity and a symbol sequence after semantic replacement from a word combination sequence.

205. And screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to the default sequence of the items to generate a question template.

Specifically, template mining is performed on the log corpus converted into the symbol sequence, and the key of template generation is to cluster frequent item sets from a large number of different tag symbol sequences after question transformation, wherein the frequent item sets can express the sentence form backbone part to a certain extent, so that automatic generation of the question template is realized according to the clustered frequent item sets. That is, frequent item set mining is performed using a predetermined association rule algorithm, and a frequent item set is found from the candidate item set. Here, the predetermined association rule algorithm may be selected to use any possible association rule algorithm in the prior art as needed, preferably Apriori algorithm.

In order to obtain relatively good generalization ability and semantic compactness, the quality of the template generated in the previous step can be evaluated, so that a question template with better quality is obtained. And selecting corresponding candidate template sequences meeting the requirements according to the preset sequence frequency threshold range and the preset sequence length threshold range. Specifically, two threshold indexes at this point are the frequency k1 of the sequence appearing in different linguistic questions, the length k2 of the sequence, and the frequency k2 of the sequence appearing in different linguistic questions. k1 and k2 can be set empirically in general, where k1 is preferably set between [3,5] because templates with length below 3 have a low semantic compactness although they are more generalizable; template semantics with length greater than 5 are relatively tight, but relatively poor in generalization capability. Under the condition of a proper threshold value range, sentences with similar structures and public word sequences can be clustered, and question sentence templates with higher quality are obtained.

For example, there are several questions in the log corpus, "do you ask for a purchase of a genetic disorder? "," is there a hyperthyroidism available? "," can one be assured with mild diabetes? ". Through the previous step processing, the three sentences are converted into the following three symbol sequences, respectively:

[question verb_you disease question_feasible apply_v question_polar]；

[disease question_feasible apply_v question_polar]；

[disease question_feasible apply_v]。

for convenience of explanation, the semantic concepts in the three sequences are replaced with the letters "a b c d e" and the like, respectively. The sequence is converted into [ a b c d e f ], [ c d e f ], [ c d e ], the frequency and the threshold value of the length of the item set are respectively set through a frequent item set mining algorithm, namely a predetermined association rule algorithm, wherein the threshold values of the frequency and the item set are both set to be 3, firstly, the frequent item sets which appear in the sequence are calculated to be [ c d e:3], [ c d:3], [ c e:3], [ c:3], [ d:3], [ e:3] and the like, and only the sequence combinations which appear 3 times in the total sequence are listed. The letter sequence combinations before the colon in the sequence represent the frequent item set in different question sequences, and the numbers after the colon represent the frequency of the frequent item set in the corpus. For example, the sequence [ c d e ], in the three exemplified sequences, all occur in that order. Then the sequence can be obtained as required, and the symbol sequence can be used as the final generated template of the example three sentences.

It should be noted that, the above-mentioned 205 step uses a predetermined association rule algorithm to perform frequent item set mining, find frequent item sets from candidate item sets, and generate a question template, which is only exemplary, and any other possible processes or manners may be used without departing from the specific inventive concept of this step of the embodiment of the present invention, and the embodiment of the present invention is not limited thereto.

In order to further improve the quality of the question template, similar questions can be characterized on the structural characteristics through the template abstracted from the sentences in the above steps, but different corresponding questions in the template do not necessarily satisfy the similarity completely in semantics, and further evaluation needs to be performed on semantic dimensions, and the implementation scheme of the process is detailed in the following steps 206 to 209.

206. And performing sentence vector representation on the question under the screened question template by using a deep learning encoder model Skip-thunder.

The step carries out clustering closeness calculation on the clustered question sentences, ensures that different question sentences under one template have high semantic similarity, and thus evaluates the generated template in semantic dimension. The method comprises the steps of representing question sentences in a sentence vector mode, wherein a sentence vector model adopts a Skip-thunder algorithm from Google, which is an unsupervised model, expresses the sentences into vectors with fixed dimensions, and can well express semantics under large-scale linguistic data. The model is off-line training, and the training process is based on using log corpus word vectors, and when unknown words are encountered, external Chinese Wikipedia corpus can be preferably combined to be used as word expansion.

207. Calculating the clustering compactness of the question template by using the following calculation formula:

the clustering compactness of different question templates of multiple categories can be calculated through the calculation formula. Wherein, in the calculation formula, CP_jFor the calculated cluster compactness of the j-th question template, X_iIs a sentence vector of the ith question under the jth question template, W_jThe average value of all sentence vectors of the cluster corresponding to the jth question sentence template; omega_jIs the sum of the modular lengths of all sentence vectors of the cluster corresponding to the jth question sentence template, and i and j are integers which are more than or equal to 1.

208. And screening question templates with clustering compactness larger than a compactness threshold value according to a preset template clustering compactness threshold value. Illustratively, a threshold k3 of cluster compactness of the template is defined, and the template with cluster compactness larger than the threshold is screened out as a basis for evaluating the candidate templates generated in the step. Preferably, the initial value of the threshold k3 is set here, and part of templates in the original template library may be randomly extracted, the cluster compactness corresponding to each template is calculated, and then the average value is taken.

It should be noted that the step 208 is not limited to the above operation, and any possible method or method in the prior art may be adopted, and the embodiment of the present invention is not limited thereto.

209. And searching and comparing the screened question templates in a template library, and if the screened question templates do not exist in the template library, storing the screened question templates in the template library.

210. And adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates, and storing the complete question and answer pairs in a template library.

It should be noted that, in the operation processes of the steps 209 to 210, any possible process or mode in the prior art may be adopted without departing from the specific inventive concept of the step in the embodiment of the present invention, and the embodiment of the present invention is not limited thereto.

Example 3

Fig. 3 is a schematic structural diagram of an automatic question template generation apparatus according to embodiment 3 of the present invention. As shown in fig. 3, an apparatus for automatically generating a question template according to an embodiment of the present invention includes:

the preparation module 1 is used for preparing a question log corpus. Specifically, the preparation module 1 includes an obtaining module and a preprocessing module, the obtaining module is configured to obtain a question log corpus, and the preprocessing module is configured to perform preprocessing on the question log corpus, including label symbol removal, illegal symbol removal, and word case and case conversion.

And the word segmentation and part of speech tagging module 2 is used for carrying out word segmentation and part of speech tagging. Specifically, different types of industry dictionaries can be created according to different needs or industries, and particularly when a specific vertical industry is involved, the word segmentation and part of speech tagging module 2 performs word segmentation on corresponding log linguistic data by adopting a word segmentation method combined with the industry dictionary, so that a good word segmentation effect can be obtained.

And the named entity identification module 3 is used for carrying out named entity identification and replacement. Specifically, the named entity recognition module 3 is configured to: and carrying out named entity identification on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and replacing the universal entities with corresponding entity labels.

And the semantic replacing module 4 is used for performing semantic replacement. The semantic replacement module 4 is used for: and searching words after the words of the question in the question log corpus are segmented by a semantic web, abstracting and unifying words with the same or similar paraphrases into labels according to the paraphrases of the words, and performing corresponding replacement to generate a symbol label sequence consisting of named entities and semantic concepts after semantic replacement.

And the frequent item set mining module 5 is used for mining frequent item sets and generating question templates. Specifically, the frequent itemset mining module 5 is configured to: and screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to the default sequence of the items to generate a question template.

Preferably, the above automatic question template generating device further includes:

a sentence vector representation module 6, configured to perform sentence vector representation on the questions in the clustered question templates by using a preset sentence vector model; preferably, the sentence vector model is a depth learning encoder model Skip-thorights.

A cluster compactness calculating module 7, configured to calculate the cluster compactness of the question template by using the following calculation formula:

The screening module 8 is used for screening question templates with clustering compactness larger than a compactness threshold according to a preset template clustering compactness threshold;

the determining and storing module 9 is used for searching and comparing the screened question templates in the template library, and storing the screened question templates in the template library if the screened question templates do not exist in the template library;

further, preferably, the automatic question template generating device further includes:

and the answer adding module 10 is used for adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates, and storing the complete question and answer pairs in the template library.

It should be noted that: the automatic question template generating device provided in the above embodiment is only illustrated by dividing the above functional modules when performing an automatic question template generating service, and in practical applications, the above function allocation may be completed by different functional modules as needed, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the automatic question template generation device and the automatic question template generation method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments and are not described herein again.

In summary, the method and the apparatus for automatically generating a question template according to the embodiments of the present invention have the following beneficial effects, compared with the prior art:

3. the method comprises the steps that an item set meeting requirements is screened from a frequent item set according to a preset sequence frequency threshold range and a preset sequence length threshold range to generate a question template, sentences with similar structures and public word sequences can be clustered, and a question template with higher quality is obtained;

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A question template automatic generation method is characterized by comprising the following steps:

preparing a question log corpus;

performing word segmentation and part-of-speech tagging on the log corpus;

carrying out named entity identification and replacement;

performing semantic substitution to obtain a sequence of symbolic labels;

performing frequent item set mining to generate a question template, comprising: screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to a default sequence of items to generate a question template;

sentence vector representation is carried out on the questions of the screened question template by utilizing a preset sentence vector model;

2. The method of claim 1, wherein preparing a question log corpus comprises:

3. The method of claim 1, wherein performing word segmentation and part-of-speech tagging on the log corpus comprises:

4. The method of claim 1, wherein conducting named entity identification and replacement comprises:

5. The method of claim 1, wherein performing semantic substitution comprises:

6. The method according to claim 1, wherein the preset sentence vector model is a deep learning encoder model Skip-thorights.

7. The method of claim 1, further comprising:

8. An automatic question template generation device, comprising:

the preparation module is used for preparing a query sentence log corpus;

the word segmentation and part of speech tagging module is used for carrying out word segmentation and part of speech tagging;

the named entity recognition module is used for recognizing and replacing named entities;

the semantic replacing module is used for performing semantic replacement;

a frequent itemset mining module, configured to perform frequent itemset mining to generate a question template, where the frequent itemset mining module is configured to: screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to a default sequence of items to generate a question template;

the sentence vector representation module is used for carrying out sentence vector representation on the questions of the screened question template by utilizing a preset sentence vector model;

9. The apparatus according to claim 8, wherein the preparation module comprises an obtaining module and a preprocessing module, the obtaining module is configured to obtain the query sentence log corpus, and the preprocessing module is configured to perform preprocessing on the query sentence log corpus, including punctuation removal, illegal symbol removal, and word case conversion.

10. The apparatus of claim 8, wherein the named entity identification module is configured to: and carrying out named entity identification on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and replacing the universal entities with corresponding entity labels.

11. The apparatus of claim 8, wherein the semantic replacement module is configured to: and searching words after the words of the question in the question log corpus are segmented by a semantic web, abstracting and unifying words with the same or similar paraphrases into labels according to the paraphrases of the words, and performing corresponding replacement to generate a symbol label sequence consisting of named entities and semantic concepts after semantic replacement.

12. The apparatus of claim 8, wherein the predetermined sentence vector model is a deep learning encoder model Skip-thorights.

13. The apparatus of claim 8, further comprising: