CN108038234B - Automatic question template generating method and device - Google Patents

Automatic question template generating method and device Download PDF

Info

Publication number
CN108038234B
CN108038234B CN201711436114.0A CN201711436114A CN108038234B CN 108038234 B CN108038234 B CN 108038234B CN 201711436114 A CN201711436114 A CN 201711436114A CN 108038234 B CN108038234 B CN 108038234B
Authority
CN
China
Prior art keywords
question
template
module
sentence
templates
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711436114.0A
Other languages
Chinese (zh)
Other versions
CN108038234A (en
Inventor
邹辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhongan Information Technology Service Co ltd
Original Assignee
Zhongan Information Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongan Information Technology Service Co Ltd filed Critical Zhongan Information Technology Service Co Ltd
Priority to CN201711436114.0A priority Critical patent/CN108038234B/en
Publication of CN108038234A publication Critical patent/CN108038234A/en
Application granted granted Critical
Publication of CN108038234B publication Critical patent/CN108038234B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The invention discloses a method and a device for automatically generating a question template, and belongs to the technical field of intelligent question answering. The method comprises the following steps: preparing a question log corpus; performing word segmentation and part-of-speech tagging on the log corpus; carrying out named entity identification and replacement; performing semantic replacement; and (5) performing frequent item set mining to generate a question template. The method and the device not only improve the efficiency of question template generation and greatly save manual resources, but also can evaluate the generated question template, autonomously and continuously expand the question template base and improve the quality of the knowledge base of the intelligent question-answering system.

Description

Automatic question template generating method and device
Technical Field
The invention relates to the technical field of intelligent question answering, in particular to a method and a device for automatically generating a question template.
Background
Currently, more and more enterprises are undertaking a lot of users' after-sales service or pre-sales consultation. Because the number of users is exponentially increased, a large amount of manual resources are consumed for answering all the user consultations completely in a manual mode, a large number of knowledge points are relatively concentrated, and manual answering usually comprises a large amount of repeated labor, so that the intelligent question-answering system is produced, the intelligent question-answering system can automatically answer questions input by the users, and the efficiency is greatly improved.
The technical principle of the intelligent question-answering system is based on question template matching, knowledge base retrieval and the like. The question template is a specific symbol label sequence formed after recognition and replacement of a question, corresponding answers are added to the question of the question template, and when the question is the same as the template or has high similarity, the template matching technology can match and answer the question. The difficulty of the question template matching technology is how to generate question templates efficiently and sustainably. The traditional question template generation needs manual template setting aiming at a specific sentence pattern, and is not only complicated but also poor in coverage; when the knowledge base is updated, new templates which cannot be covered in the template base also need to be manually set and evaluated, and maintainability and self-learning are poor. In the related patents disclosed at present, no technical solution that is improved correspondingly to the above technical problems has been found, for example, in patent application 201611076382.1 (a method and an apparatus for automatic question-answering template matching), a matching problem of a question to be answered is obtained by determining a subset of a template problem set of each word segmentation corresponding to the question to be answered, and template matching efficiency and accuracy of an automatic question-answering system are improved, but the template automatic generation and the generation of a template quality evaluation problem are not involved.
Disclosure of Invention
In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for automatically generating a question template. The technical scheme is as follows:
in a first aspect, a method for automatically generating a question template is provided, where the method includes:
preparing a question log corpus; performing word segmentation and part-of-speech tagging on the log corpus; carrying out named entity identification and replacement; performing semantic replacement; and (5) performing frequent item set mining to generate a question template.
With reference to the first aspect, in a first possible implementation manner, preparing a question log corpus includes:
and obtaining a question log corpus, and preprocessing the question log corpus, including punctuation removal, illegal symbolic removal and case-case conversion of words.
With reference to the first aspect, in a second possible implementation manner, performing word segmentation and part-of-speech tagging on the log corpus includes:
and performing word segmentation on the log corpus by combining a word segmentation method of an industry dictionary.
With reference to the first aspect, in a third possible implementation manner, the named entity identifying and replacing includes:
and carrying out named entity identification on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and replacing the universal entities with corresponding entity labels.
With reference to the first aspect, in a fourth possible implementation manner, performing semantic replacement includes:
and searching words after the words of the question in the question log corpus are segmented by a semantic web, abstracting and unifying words with the same or similar paraphrases into labels according to the paraphrases of the words, and performing corresponding replacement to generate a symbol label sequence consisting of named entities and semantic concepts after semantic replacement.
With reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner, performing frequent item set mining to generate a question template includes:
by setting a threshold range, a frequent item set is obtained from a candidate item set of a question corpus log, and a question template is generated.
With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner, performing frequent item set mining to generate a question template includes:
and screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to the default sequence of the items to generate a question template.
With reference to the fifth to sixth possible implementation manners of the first aspect, in seventh to eight possible implementation manners, the method further includes:
sentence vector representation is carried out on the questions under the screened question template by utilizing a preset sentence vector model;
calculating the clustering compactness of the question template by using the following calculation formula:
Figure BDA0001525849240000031
screening question templates with clustering compactness larger than a compactness threshold according to a preset template clustering compactness threshold;
searching and comparing the screened question templates in a template library, and if the screened question templates do not exist in the template library, storing the screened question templates in the template library;
wherein, in the calculation formula, CPjFor the calculated cluster compactness of the j-th question template, XiIs a sentence vector of the ith question under the jth question template, WjThe average value of all sentence vectors of the cluster corresponding to the jth question sentence template; omegajIs the sum of the modular lengths of all sentence vectors of the cluster corresponding to the jth question sentence template, and i and j are integers which are more than or equal to 1.
With reference to the seventh to eighth possible implementation manners of the first aspect, in ninth to tenth possible implementation manners, the preset sentence vector model is a deep learning encoder model Skip-Thoughts.
With reference to the seventh to eighth possible implementations of the first aspect, in eleventh to twelfth possible implementations, the method further includes:
and adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates, and storing the complete question and answer pairs in a template library.
In a second aspect, an apparatus for automatically generating a question template is provided, which includes:
the preparation module is used for preparing a query sentence log corpus; the word segmentation and part of speech tagging module is used for carrying out word segmentation and part of speech tagging; the named entity recognition module is used for recognizing and replacing named entities; the semantic replacing module is used for performing semantic replacement; and the frequent item set mining module is used for mining frequent item sets to generate question templates.
With reference to the second aspect, in a first possible implementation manner, the preparation module includes an obtaining module and a preprocessing module, where the obtaining module is configured to obtain a question log corpus, and the preprocessing module is configured to perform preprocessing on the question log corpus, including punctuation removal, illegal symbol removal, and word case conversion.
With reference to the second aspect, in a second possible implementation manner, the named entity identifying module is configured to: and carrying out named entity identification on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and replacing the universal entities with corresponding entity labels.
With reference to the second aspect, in a third possible implementation manner, the semantic replacement module is configured to: and searching words after the words of the question in the question log corpus are segmented by a semantic web, abstracting and unifying words with the same or similar paraphrases into labels according to the paraphrases of the words, and performing corresponding replacement to generate a symbol label sequence consisting of named entities and semantic concepts after semantic replacement.
With reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner, the frequent item set mining module is configured to: and screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to the default sequence of the items to generate a question template.
With reference to the third to four possible implementation manners of the second aspect, in fifth to sixth possible implementation manners, the apparatus further includes:
the sentence vector representation module is used for carrying out sentence vector representation on the questions under the screened question template by utilizing a preset sentence vector model;
the cluster compactness calculation module is used for calculating the cluster compactness of the question template by using the following calculation formula:
Figure BDA0001525849240000041
the screening module is used for screening question templates with clustering compactness larger than a compactness threshold according to a preset template clustering compactness threshold;
the confirming and storing module is used for searching and comparing the screened question templates in the template library, and storing the screened question templates in the template library if the screened question templates do not exist in the template library;
wherein, in the calculation formula, CPjFor the calculated cluster compactness of the j-th question template, XiIs a sentence vector of the ith question under the jth question template, WjThe average value of all sentence vectors of the cluster corresponding to the jth question sentence template; omegajCorresponding to the jth question templateAnd (4) the modular length sum of all sentence vectors of the cluster, i and j are integers which are more than or equal to 1.
With reference to the fifth to sixth possible implementation manners of the second aspect, in seventh to eighth possible implementation manners, the preset sentence vector model is a deep learning encoder model Skip-roads.
With reference to the fifth to sixth possible implementation manners of the second aspect, in ninth to tenth possible implementation manners, the apparatus further includes:
and the answer adding module is used for adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates and storing the complete question and answer pairs in the template library.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
1. through the semantic replacing step, the words with multiple words and one meaning are abstracted and unified according to the paraphrases, so that the generalization capability of the semantics is increased;
2. by mining the frequent item sets, the frequent item sets are found from the candidate item sets, and the question template is generated, so that the efficiency of generating the question template is improved, and a large amount of manual resources are saved;
3. the method comprises the steps that an item set meeting requirements is screened from a frequent item set according to a preset item set frequency threshold range and a preset item set length threshold range to generate a question template, sentences with similar structures and public word sequences can be clustered, and a question template with higher quality is obtained;
4. by using the sentence vector representation of the preset sentence vector model, the cluster compactness calculation and the screening of question templates meeting the requirements, the quality evaluation of the generated templates in semantic dimensions can be realized, so that high-quality question templates with higher accuracy are obtained;
5. searching and comparing the screened question templates in a template library, and if the screened question templates do not exist in the template library, storing the screened question templates in the template library so as to facilitate the effective updating of the generated templates in the template library;
6. and adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates, storing the complete question and answer pairs in a template library, ensuring the integrity of the question and answer pairs of the templates in the template library, and realizing the answer matching of automatically generated question templates.
In summary, the method and the device for automatically generating the question template provided by the embodiment of the invention can efficiently and automatically generate the question template, can evaluate the quality of the generated question template, autonomously and continuously expand or update the question template library, improve the quality of the intelligent question-answering system knowledge base, and can be widely popularized and applied in the technical fields of intelligent question-answering and the like which need customer service.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of an automatic question template generation method provided in embodiment 1 of the present invention;
fig. 2 is a flowchart of an automatic question template generation method provided in embodiment 2 of the present invention;
fig. 3 is a schematic structural diagram of an automatic question template generation apparatus according to embodiment 3 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
The invention aims to solve the problems of generation and evaluation of question templates in the template matching technology of an intelligent question-answering system. In the question log of the question-answering system, the method and the device for automatically generating the question template provided by the embodiment of the invention can automatically generate the question template candidate set of the question-answering system by mining the frequent item set from the question log corpus, thereby automatically generating the question template, evaluating the quality of the template and establishing a self-learning mechanism of the intelligent question-answering system. Compared with the traditional method for establishing the template by utilizing the manual rule, the method and the device not only improve the efficiency of generating the question template and greatly save manual resources, but also can evaluate the generated question template, autonomously and continuously expand the question template library and improve the quality of the knowledge base of the intelligent question-answering system, so that the method and the device for automatically generating the question template provided by the embodiment of the invention can be widely popularized and applied in the technical fields of intelligent question-answering and the like which need customer service.
Example 1
Fig. 1 is a flowchart of an automatic question template generation method according to an embodiment of the present invention. As shown in fig. 1, the method for automatically generating a question template according to an embodiment of the present invention includes the following steps:
101. and preparing a question log corpus.
Specifically, a query sentence log corpus is obtained, and the query sentence log corpus is preprocessed, including but not limited to punctuation removal, illegal symbol removal, case-to-case word conversion, and the like.
102. And performing word segmentation and part-of-speech tagging on the log corpus.
Specifically, the method combines the word segmentation method of the industry dictionary to segment words of the log corpus. According to the needs or different industries, different types of industry dictionaries can be created, and particularly when a specific vertical industry is involved, a word segmentation method combined with the industry dictionaries is adopted to segment words of corresponding log linguistic data, so that a good word segmentation effect can be achieved.
103. Named entity recognition and replacement is performed.
Specifically, named entity recognition is performed on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and the universal entities are replaced with corresponding entity tags.
It should be noted that the above-mentioned log corpus preprocessing, word segmentation and part-of-speech tagging, and named entity recognition and replacement are sentence analysis processing techniques commonly used in the field of natural language processing, and any possible technical means or manners in the prior art may also be used to implement the above-mentioned processes, and in order to avoid redundancy, detailed description is not provided here.
104. Semantic replacement is performed.
Specifically, words after the words of the question in the question log corpus are segmented are searched through a semantic network, words with the same or similar paraphrases are abstractly unified into labels according to the paraphrases of the words, corresponding replacement is carried out, and a symbol label sequence formed by the named entities and semantic concepts after semantic replacement is generated. Here the semantic web preferably employs the chinese semantic web HowNet. The purpose of this procedure is to increase the generalization capability of semantics through semantic abstraction, where words in a sentence that cannot be found in the semantic web or identified by a named entity can be directly ignored.
105. And (5) performing frequent item set mining to generate a question template.
Specifically, by setting a threshold range, a frequent item set is obtained from a candidate item set of a question corpus log, and a question template is generated. The log linguistic data converted into the symbol sequence is subjected to template mining, and the key of template generation is that frequent item sets are clustered from a large number of different question and sentence transformed tag symbol sequences, and can express a sentence form backbone part to a certain extent, so that automatic generation of a question and sentence template is realized according to the clustered frequent item sets. Preferably, according to the preset frequency threshold range and the preset item set length threshold range, a preset association rule algorithm is used to obtain a desired frequent item set from the symbol tag sequence so as to generate a question template. That is, frequent item set mining is performed using a predetermined association rule algorithm, and a frequent item set is found from the candidate item set. Here, the predetermined association rule algorithm may be selected to use any possible association rule algorithm in the prior art as needed, preferably Apriori algorithm.
Example 2
Fig. 2 is a flowchart of an automatic question template generation method provided in embodiment 2 of the present invention. As shown in fig. 2, the method for automatically generating a question template according to an embodiment of the present invention includes the following steps:
201. and obtaining a question log corpus, and preprocessing the question log corpus, including but not limited to punctuation removal, illegal symbol removal and word case and case conversion.
It should be noted that any possible obtaining method in the prior art may be adopted as the obtaining method of the query log corpus, and the embodiment of the present invention is not particularly limited thereto; the process or manner for implementing the pre-processing of the log corpus is not limited to the above, and any possible technical means or manner in the prior art may be adopted, which is not described herein in detail.
202. And performing word segmentation and part-of-speech tagging on the log corpus by combining a word segmentation method of an industry dictionary.
According to the needs or different industries, different types of industry dictionaries can be created, and particularly when a specific vertical industry is involved, a word segmentation method combined with the industry dictionaries is adopted to segment words of corresponding log linguistic data, so that a good word segmentation effect can be achieved.
It should be noted that the step 202 is not limited to the above operation, and any possible method or method in the prior art may be adopted, and the embodiment of the present invention is not limited thereto.
203. And carrying out named entity identification on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and replacing the universal entities with corresponding entity labels.
It should be noted that the named entity identification and replacement can be implemented by any possible technical means or manner in the prior art, and will not be described in detail here to avoid redundancy.
204. And searching words after the words of the question in the question log corpus are segmented by a semantic web, abstracting and unifying words with the same or similar paraphrases into labels according to the paraphrases of the words, and performing corresponding replacement to generate a symbol label sequence consisting of named entities and semantic concepts after semantic replacement.
Because of the large number of one-meaning-multiple-word situations in Chinese, it is desirable to abstract concrete words into a word sense representation so as to increase the generalization capability of semantics. Here the chinese semantic web HowNet is preferably used for semantic abstraction substitution. The specific operation mode is that the words after word segmentation are searched in a semantic network, paraphrases of the words are arranged in the semantic network, the words with the same or similar paraphrases are abstracted and unified into labels according to the paraphrases of the words, and corresponding replacement is carried out. For example, the word "hepatitis" is defined as "disease" in HowNet, and the word "cold" is also defined as "disease" in semantic web, and these words representing "disease" are collectively replaced with "disease" in the sentence, thereby achieving the goal of semantic abstraction. And words in the sentence which cannot be found in the semantic web or identified by the named entity can be directly ignored. After the processing of the step, the corpus question sentence is converted into a named entity and a symbol sequence after semantic replacement from a word combination sequence.
205. And screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to the default sequence of the items to generate a question template.
Specifically, template mining is performed on the log corpus converted into the symbol sequence, and the key of template generation is to cluster frequent item sets from a large number of different tag symbol sequences after question transformation, wherein the frequent item sets can express the sentence form backbone part to a certain extent, so that automatic generation of the question template is realized according to the clustered frequent item sets. That is, frequent item set mining is performed using a predetermined association rule algorithm, and a frequent item set is found from the candidate item set. Here, the predetermined association rule algorithm may be selected to use any possible association rule algorithm in the prior art as needed, preferably Apriori algorithm.
In order to obtain relatively good generalization ability and semantic compactness, the quality of the template generated in the previous step can be evaluated, so that a question template with better quality is obtained. And selecting corresponding candidate template sequences meeting the requirements according to the preset sequence frequency threshold range and the preset sequence length threshold range. Specifically, two threshold indexes at this point are the frequency k1 of the sequence appearing in different linguistic questions, the length k2 of the sequence, and the frequency k2 of the sequence appearing in different linguistic questions. k1 and k2 can be set empirically in general, where k1 is preferably set between [3,5] because templates with length below 3 have a low semantic compactness although they are more generalizable; template semantics with length greater than 5 are relatively tight, but relatively poor in generalization capability. Under the condition of a proper threshold value range, sentences with similar structures and public word sequences can be clustered, and question sentence templates with higher quality are obtained.
For example, there are several questions in the log corpus, "do you ask for a purchase of a genetic disorder? "," is there a hyperthyroidism available? "," can one be assured with mild diabetes? ". Through the previous step processing, the three sentences are converted into the following three symbol sequences, respectively:
[question verb_you disease question_feasible apply_v question_polar];
[disease question_feasible apply_v question_polar];
[disease question_feasible apply_v]。
for convenience of explanation, the semantic concepts in the three sequences are replaced with the letters "a b c d e" and the like, respectively. The sequence is converted into [ a b c d e f ], [ c d e f ], [ c d e ], the frequency and the threshold value of the length of the item set are respectively set through a frequent item set mining algorithm, namely a predetermined association rule algorithm, wherein the threshold values of the frequency and the item set are both set to be 3, firstly, the frequent item sets which appear in the sequence are calculated to be [ c d e:3], [ c d:3], [ c e:3], [ c:3], [ d:3], [ e:3] and the like, and only the sequence combinations which appear 3 times in the total sequence are listed. The letter sequence combinations before the colon in the sequence represent the frequent item set in different question sequences, and the numbers after the colon represent the frequency of the frequent item set in the corpus. For example, the sequence [ c d e ], in the three exemplified sequences, all occur in that order. Then the sequence can be obtained as required, and the symbol sequence can be used as the final generated template of the example three sentences.
It should be noted that, the above-mentioned 205 step uses a predetermined association rule algorithm to perform frequent item set mining, find frequent item sets from candidate item sets, and generate a question template, which is only exemplary, and any other possible processes or manners may be used without departing from the specific inventive concept of this step of the embodiment of the present invention, and the embodiment of the present invention is not limited thereto.
In order to further improve the quality of the question template, similar questions can be characterized on the structural characteristics through the template abstracted from the sentences in the above steps, but different corresponding questions in the template do not necessarily satisfy the similarity completely in semantics, and further evaluation needs to be performed on semantic dimensions, and the implementation scheme of the process is detailed in the following steps 206 to 209.
206. And performing sentence vector representation on the question under the screened question template by using a deep learning encoder model Skip-thunder.
The step carries out clustering closeness calculation on the clustered question sentences, ensures that different question sentences under one template have high semantic similarity, and thus evaluates the generated template in semantic dimension. The method comprises the steps of representing question sentences in a sentence vector mode, wherein a sentence vector model adopts a Skip-thunder algorithm from Google, which is an unsupervised model, expresses the sentences into vectors with fixed dimensions, and can well express semantics under large-scale linguistic data. The model is off-line training, and the training process is based on using log corpus word vectors, and when unknown words are encountered, external Chinese Wikipedia corpus can be preferably combined to be used as word expansion.
207. Calculating the clustering compactness of the question template by using the following calculation formula:
Figure BDA0001525849240000121
the clustering compactness of different question templates of multiple categories can be calculated through the calculation formula. Wherein, in the calculation formula, CPjFor the calculated cluster compactness of the j-th question template, XiIs a sentence vector of the ith question under the jth question template, WjThe average value of all sentence vectors of the cluster corresponding to the jth question sentence template; omegajIs the sum of the modular lengths of all sentence vectors of the cluster corresponding to the jth question sentence template, and i and j are integers which are more than or equal to 1.
208. And screening question templates with clustering compactness larger than a compactness threshold value according to a preset template clustering compactness threshold value. Illustratively, a threshold k3 of cluster compactness of the template is defined, and the template with cluster compactness larger than the threshold is screened out as a basis for evaluating the candidate templates generated in the step. Preferably, the initial value of the threshold k3 is set here, and part of templates in the original template library may be randomly extracted, the cluster compactness corresponding to each template is calculated, and then the average value is taken.
It should be noted that the step 208 is not limited to the above operation, and any possible method or method in the prior art may be adopted, and the embodiment of the present invention is not limited thereto.
209. And searching and comparing the screened question templates in a template library, and if the screened question templates do not exist in the template library, storing the screened question templates in the template library.
210. And adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates, and storing the complete question and answer pairs in a template library.
It should be noted that, in the operation processes of the steps 209 to 210, any possible process or mode in the prior art may be adopted without departing from the specific inventive concept of the step in the embodiment of the present invention, and the embodiment of the present invention is not limited thereto.
Example 3
Fig. 3 is a schematic structural diagram of an automatic question template generation apparatus according to embodiment 3 of the present invention. As shown in fig. 3, an apparatus for automatically generating a question template according to an embodiment of the present invention includes:
the preparation module 1 is used for preparing a question log corpus. Specifically, the preparation module 1 includes an obtaining module and a preprocessing module, the obtaining module is configured to obtain a question log corpus, and the preprocessing module is configured to perform preprocessing on the question log corpus, including label symbol removal, illegal symbol removal, and word case and case conversion.
And the word segmentation and part of speech tagging module 2 is used for carrying out word segmentation and part of speech tagging. Specifically, different types of industry dictionaries can be created according to different needs or industries, and particularly when a specific vertical industry is involved, the word segmentation and part of speech tagging module 2 performs word segmentation on corresponding log linguistic data by adopting a word segmentation method combined with the industry dictionary, so that a good word segmentation effect can be obtained.
And the named entity identification module 3 is used for carrying out named entity identification and replacement. Specifically, the named entity recognition module 3 is configured to: and carrying out named entity identification on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and replacing the universal entities with corresponding entity labels.
And the semantic replacing module 4 is used for performing semantic replacement. The semantic replacement module 4 is used for: and searching words after the words of the question in the question log corpus are segmented by a semantic web, abstracting and unifying words with the same or similar paraphrases into labels according to the paraphrases of the words, and performing corresponding replacement to generate a symbol label sequence consisting of named entities and semantic concepts after semantic replacement.
And the frequent item set mining module 5 is used for mining frequent item sets and generating question templates. Specifically, the frequent itemset mining module 5 is configured to: and screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to the default sequence of the items to generate a question template.
Preferably, the above automatic question template generating device further includes:
a sentence vector representation module 6, configured to perform sentence vector representation on the questions in the clustered question templates by using a preset sentence vector model; preferably, the sentence vector model is a depth learning encoder model Skip-thorights.
A cluster compactness calculating module 7, configured to calculate the cluster compactness of the question template by using the following calculation formula:
Figure BDA0001525849240000141
wherein, in the calculation formula, CPjFor the calculated cluster compactness of the j-th question template, XiIs a sentence vector of the ith question under the jth question template, WjThe average value of all sentence vectors of the cluster corresponding to the jth question sentence template; omegajIs the sum of the modular lengths of all sentence vectors of the cluster corresponding to the jth question sentence template, and i and j are integers which are more than or equal to 1.
The screening module 8 is used for screening question templates with clustering compactness larger than a compactness threshold according to a preset template clustering compactness threshold;
the determining and storing module 9 is used for searching and comparing the screened question templates in the template library, and storing the screened question templates in the template library if the screened question templates do not exist in the template library;
further, preferably, the automatic question template generating device further includes:
and the answer adding module 10 is used for adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates, and storing the complete question and answer pairs in the template library.
It should be noted that: the automatic question template generating device provided in the above embodiment is only illustrated by dividing the above functional modules when performing an automatic question template generating service, and in practical applications, the above function allocation may be completed by different functional modules as needed, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the automatic question template generation device and the automatic question template generation method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments and are not described herein again.
In summary, the method and the apparatus for automatically generating a question template according to the embodiments of the present invention have the following beneficial effects, compared with the prior art:
1. through the semantic replacing step, the words with multiple words and one meaning are abstracted and unified according to the paraphrases, so that the generalization capability of the semantics is increased;
2. by mining the frequent item sets, the frequent item sets are found from the candidate item sets, and the question template is generated, so that the efficiency of generating the question template is improved, and a large amount of manual resources are saved;
3. the method comprises the steps that an item set meeting requirements is screened from a frequent item set according to a preset sequence frequency threshold range and a preset sequence length threshold range to generate a question template, sentences with similar structures and public word sequences can be clustered, and a question template with higher quality is obtained;
4. by using the sentence vector representation of the preset sentence vector model, the cluster compactness calculation and the screening of question templates meeting the requirements, the quality evaluation of the generated templates in semantic dimensions can be realized, so that high-quality question templates with higher accuracy are obtained;
5. searching and comparing the screened question templates in a template library, and if the screened question templates do not exist in the template library, storing the screened question templates in the template library so as to facilitate the effective updating of the generated templates in the template library;
6. and adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates, storing the complete question and answer pairs in a template library, ensuring the integrity of the question and answer pairs of the templates in the template library, and realizing the answer matching of automatically generated question templates.
In summary, the method and the device for automatically generating the question template provided by the embodiment of the invention can efficiently and automatically generate the question template, can evaluate the quality of the generated question template, autonomously and continuously expand or update the question template library, improve the quality of the intelligent question-answering system knowledge base, and can be widely popularized and applied in the technical fields of intelligent question-answering and the like which need customer service.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (13)

1. A question template automatic generation method is characterized by comprising the following steps:
preparing a question log corpus;
performing word segmentation and part-of-speech tagging on the log corpus;
carrying out named entity identification and replacement;
performing semantic substitution to obtain a sequence of symbolic labels;
performing frequent item set mining to generate a question template, comprising: screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to a default sequence of items to generate a question template;
sentence vector representation is carried out on the questions of the screened question template by utilizing a preset sentence vector model;
calculating the clustering compactness of the question template by using the following calculation formula:
Figure FDA0003042193480000011
screening question templates with clustering compactness larger than a compactness threshold according to a preset template clustering compactness threshold;
searching and comparing the screened question templates in a template library, and if the screened question templates do not exist in the template library, storing the screened question templates in the template library;
wherein, in the calculation formula, CPjFor the calculated cluster compactness of the j-th question template, XiIs a sentence vector of the ith question under the jth question template, WjThe average value of all sentence vectors of the cluster corresponding to the jth question sentence template; omegajIs the sum of the modular lengths of all sentence vectors of the cluster corresponding to the jth question sentence template, and i and j are integers which are more than or equal to 1.
2. The method of claim 1, wherein preparing a question log corpus comprises:
and obtaining a question log corpus, and preprocessing the question log corpus, including punctuation removal, illegal symbolic removal and case-case conversion of words.
3. The method of claim 1, wherein performing word segmentation and part-of-speech tagging on the log corpus comprises:
and performing word segmentation on the log corpus by combining a word segmentation method of an industry dictionary.
4. The method of claim 1, wherein conducting named entity identification and replacement comprises:
and carrying out named entity identification on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and replacing the universal entities with corresponding entity labels.
5. The method of claim 1, wherein performing semantic substitution comprises:
and searching words after the words of the question in the question log corpus are segmented by a semantic web, abstracting and unifying words with the same or similar paraphrases into labels according to the paraphrases of the words, and performing corresponding replacement to generate a symbol label sequence consisting of named entities and semantic concepts after semantic replacement.
6. The method according to claim 1, wherein the preset sentence vector model is a deep learning encoder model Skip-thorights.
7. The method of claim 1, further comprising:
and adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates, and storing the complete question and answer pairs in a template library.
8. An automatic question template generation device, comprising:
the preparation module is used for preparing a query sentence log corpus;
the word segmentation and part of speech tagging module is used for carrying out word segmentation and part of speech tagging;
the named entity recognition module is used for recognizing and replacing named entities;
the semantic replacing module is used for performing semantic replacement;
a frequent itemset mining module, configured to perform frequent itemset mining to generate a question template, where the frequent itemset mining module is configured to: screening frequent item sets from the symbol tag sequence by using a preset association rule algorithm according to a preset frequency threshold range and a preset item set length threshold range, and forming a sequence according to a default sequence of items to generate a question template;
the sentence vector representation module is used for carrying out sentence vector representation on the questions of the screened question template by utilizing a preset sentence vector model;
the cluster compactness calculation module is used for calculating the cluster compactness of the question template by using the following calculation formula:
Figure FDA0003042193480000031
the screening module is used for screening question templates with clustering compactness larger than a compactness threshold according to a preset template clustering compactness threshold;
the confirming and storing module is used for searching and comparing the screened question templates in the template library, and storing the screened question templates in the template library if the screened question templates do not exist in the template library;
wherein, in the calculation formula, CPjFor the calculated cluster compactness of the j-th question template, XiIs a sentence vector of the ith question under the jth question template, WjThe average value of all sentence vectors of the cluster corresponding to the jth question sentence template; omegajIs the sum of the modular lengths of all sentence vectors of the cluster corresponding to the jth question sentence template, and i and j are integers which are more than or equal to 1.
9. The apparatus according to claim 8, wherein the preparation module comprises an obtaining module and a preprocessing module, the obtaining module is configured to obtain the query sentence log corpus, and the preprocessing module is configured to perform preprocessing on the query sentence log corpus, including punctuation removal, illegal symbol removal, and word case conversion.
10. The apparatus of claim 8, wherein the named entity identification module is configured to: and carrying out named entity identification on the universal entities including time, numbers and/or place names appearing in the query sentence log corpus, and replacing the universal entities with corresponding entity labels.
11. The apparatus of claim 8, wherein the semantic replacement module is configured to: and searching words after the words of the question in the question log corpus are segmented by a semantic web, abstracting and unifying words with the same or similar paraphrases into labels according to the paraphrases of the words, and performing corresponding replacement to generate a symbol label sequence consisting of named entities and semantic concepts after semantic replacement.
12. The apparatus of claim 8, wherein the predetermined sentence vector model is a deep learning encoder model Skip-thorights.
13. The apparatus of claim 8, further comprising:
and the answer adding module is used for adding answers corresponding to the screened question templates, forming complete question and answer pairs with the screened question templates and storing the complete question and answer pairs in the template library.
CN201711436114.0A 2017-12-26 2017-12-26 Automatic question template generating method and device Active CN108038234B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711436114.0A CN108038234B (en) 2017-12-26 2017-12-26 Automatic question template generating method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711436114.0A CN108038234B (en) 2017-12-26 2017-12-26 Automatic question template generating method and device

Publications (2)

Publication Number Publication Date
CN108038234A CN108038234A (en) 2018-05-15
CN108038234B true CN108038234B (en) 2021-06-15

Family

ID=62101304

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711436114.0A Active CN108038234B (en) 2017-12-26 2017-12-26 Automatic question template generating method and device

Country Status (1)

Country Link
CN (1) CN108038234B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108776677B (en) * 2018-05-28 2021-11-12 深圳前海微众银行股份有限公司 Parallel sentence library creating method and device and computer readable storage medium
CN108829680A (en) * 2018-06-22 2018-11-16 北京百悟科技有限公司 A kind of violation publicity detection method and device, computer readable storage medium
CN110738033B (en) * 2018-07-03 2023-09-19 百度在线网络技术(北京)有限公司 Report template generation method, device and storage medium
CN109033390B (en) * 2018-07-27 2020-02-18 深圳追一科技有限公司 Method and device for automatically generating similar question sentences
CN109241251B (en) * 2018-07-27 2022-05-27 众安信息技术服务有限公司 Conversation interaction method
CN109461039A (en) * 2018-08-28 2019-03-12 厦门快商通信息技术有限公司 A kind of text handling method and intelligent customer service method
CN109522534B (en) * 2018-10-12 2022-12-13 北京来也网络科技有限公司 Task generation method and device for corpus processing
CN109408821B (en) * 2018-10-22 2020-09-04 腾讯科技(深圳)有限公司 Corpus generation method and device, computing equipment and storage medium
CN109271492A (en) * 2018-11-16 2019-01-25 广东小天才科技有限公司 A kind of automatic generation method and system of corpus regular expression
WO2020102571A1 (en) * 2018-11-16 2020-05-22 Liveperson, Inc. Automatic bot creation based on scripts
CN109597873B (en) * 2018-11-21 2022-02-08 腾讯科技(深圳)有限公司 Corpus data processing method and device, computer readable medium and electronic equipment
CN110196897B (en) * 2019-05-23 2021-07-30 竹间智能科技(上海)有限公司 Case identification method based on question and answer template
CN110362803B (en) * 2019-07-19 2020-12-18 北京邮电大学 Text template generation method based on domain feature lexical combination
CN110727780A (en) * 2019-10-17 2020-01-24 福建天晴数码有限公司 System and method for automatically expanding acquaintance text
CN111597322B (en) * 2019-12-28 2023-04-21 华南理工大学 Automatic template mining system and method based on frequent item sets
CN111552862B (en) * 2019-12-28 2023-04-21 华南理工大学 Automatic template mining system and method based on cross support evaluation
CN111309858B (en) * 2020-01-20 2023-03-07 腾讯科技(深圳)有限公司 Information identification method, device, equipment and medium
CN111274361A (en) * 2020-01-21 2020-06-12 北京明略软件系统有限公司 Industry new word discovery method and device, storage medium and electronic equipment
CN111382256B (en) * 2020-03-20 2024-04-09 北京百度网讯科技有限公司 Information recommendation method and device
CN112948561B (en) * 2021-03-29 2023-07-07 建信金融科技有限责任公司 Method and device for automatically expanding question-answer knowledge base
CN117130791B (en) * 2023-10-26 2023-12-26 南通话时代信息科技有限公司 Computing power resource allocation method and system of cloud customer service platform

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7584100B2 (en) * 2004-06-30 2009-09-01 Microsoft Corporation Method and system for clustering using generalized sentence patterns
US7567895B2 (en) * 2004-08-31 2009-07-28 Microsoft Corporation Method and system for prioritizing communications based on sentence classifications
CN105868313B (en) * 2016-03-25 2019-02-12 浙江大学 A kind of knowledge mapping question answering system and method based on template matching technique
CN106649612B (en) * 2016-11-29 2020-05-01 中国银联股份有限公司 Method and device for automatically matching question and answer templates

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于语义逻辑推理的地理试题解答方法研究;向鑫;《中国优秀硕士学位论文全文数据库 哲学与人文科学辑》;20170315;第F084-410页 *

Also Published As

Publication number Publication date
CN108038234A (en) 2018-05-15

Similar Documents

Publication Publication Date Title
CN108038234B (en) Automatic question template generating method and device
CN111259653B (en) Knowledge graph question-answering method, system and terminal based on entity relationship disambiguation
WO2021093755A1 (en) Matching method and apparatus for questions, and reply method and apparatus for questions
CN113591902A (en) Cross-modal understanding and generating method and device based on multi-modal pre-training model
CN109325040B (en) FAQ question-answer library generalization method, device and equipment
US9536443B2 (en) Evaluating expert opinions in a question and answer system
CN113672708A (en) Language model training method, question and answer pair generation method, device and equipment
CN111767694B (en) Text generation method, apparatus and computer readable storage medium
CN111581368A (en) Intelligent expert recommendation-oriented user image drawing method based on convolutional neural network
CN111651569B (en) Knowledge base question-answering method and system in electric power field
CN112131876A (en) Method and system for determining standard problem based on similarity
US20230008897A1 (en) Information search method and device, electronic device, and storage medium
CN112364132A (en) Similarity calculation model and system based on dependency syntax and method for building system
CN116775847A (en) Question answering method and system based on knowledge graph and large language model
CN113282729A (en) Question-answering method and device based on knowledge graph
CN113821605A (en) Event extraction method
CN112417121A (en) Client intention recognition method and device, computer equipment and storage medium
CN111353026A (en) Intelligent law attorney assistant customer service system
CN116010581A (en) Knowledge graph question-answering method and system based on power grid hidden trouble shooting scene
CN115204156A (en) Keyword extraction method and device
CN112528654A (en) Natural language processing method and device and electronic equipment
CN114564563A (en) End-to-end entity relationship joint extraction method and system based on relationship decomposition
CN112632258A (en) Text data processing method and device, computer equipment and storage medium
CN114842301A (en) Semi-supervised training method of image annotation model
CN115062123A (en) Knowledge base question-answer pair generation method of conversation generation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240306

Address after: Room 1179, W Zone, 11th Floor, Building 1, No. 158 Shuanglian Road, Qingpu District, Shanghai, 201702

Patentee after: Shanghai Zhongan Information Technology Service Co.,Ltd.

Country or region after: China

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Patentee before: ZHONGAN INFORMATION TECHNOLOGY SERVICE Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240415

Address after: Room 1179, W Zone, 11th Floor, Building 1, No. 158 Shuanglian Road, Qingpu District, Shanghai, 201702

Patentee after: Shanghai Zhongan Information Technology Service Co.,Ltd.

Country or region after: China

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Patentee before: ZHONGAN INFORMATION TECHNOLOGY SERVICE Co.,Ltd.

Country or region before: China