CN109408821B - Corpus generation method and device, computing equipment and storage medium - Google Patents

Corpus generation method and device, computing equipment and storage medium Download PDF

Info

Publication number
CN109408821B
CN109408821B CN201811232263.XA CN201811232263A CN109408821B CN 109408821 B CN109408821 B CN 109408821B CN 201811232263 A CN201811232263 A CN 201811232263A CN 109408821 B CN109408821 B CN 109408821B
Authority
CN
China
Prior art keywords
corpus
template
main
target field
query sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811232263.XA
Other languages
Chinese (zh)
Other versions
CN109408821A (en
Inventor
周辉阳
饶孟良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201811232263.XA priority Critical patent/CN109408821B/en
Publication of CN109408821A publication Critical patent/CN109408821A/en
Application granted granted Critical
Publication of CN109408821B publication Critical patent/CN109408821B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention relates to the technical field of natural language processing, and discloses a corpus generation method, a corpus generation device, a computing device and a storage medium, which are used for improving corpus quality and further improving the identification accuracy of a model, wherein the method comprises the following steps: acquiring query statement templates of a target field, wherein the query statement template of each field comprises at least one general query sentence pattern with an expansion sample, which is set for the field, and each general query sentence pattern comprises a main word describing the field and a predicate describing the attribute of the main word; extracting main words belonging to the target field and predicates corresponding to the main words from the first knowledge graph; and replacing at least one general query sentence pattern in the query sentence template by the extracted main words and the predicates corresponding to the main words to generate the corpus of the target field.

Description

Corpus generation method and device, computing equipment and storage medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a corpus generating method, apparatus, computing device, and storage medium.
Background
In natural language processing, in order to be able to recognize a user's intention more accurately, a model for recognizing the user's intention needs to be built, and the recognition capability of the model is largely dependent on the quality of the corpus used to train the model. The corpus quality in one field mainly reflects two aspects, namely the richer the corpus is, namely the larger the corpus is, the better the model trained by the corpus is, and the higher the support degree of mass query of a user is; secondly, the higher the corpus discrimination between different fields is, the more obvious the boundary is, the better the classification effect of the trained model, namely the recognition accuracy is, and the more accurate the prediction of the fuzzy user question method intention is.
Therefore, how to improve the corpus quality to improve the recognition accuracy of the model is also a technical problem to be solved.
Disclosure of Invention
The embodiment of the invention provides a corpus generation method, a corpus generation device, a computing device and a storage medium, which are used for improving corpus quality and further improving identification accuracy of a model.
In one aspect, an embodiment of the present invention provides a corpus generating method, including:
acquiring query statement templates of a target field, wherein the query statement template of each field comprises at least one general query sentence pattern with an expansion sample, which is set for the field, and each general query sentence pattern comprises a main word describing the field and a predicate describing the attribute of the main word;
extracting main words belonging to the target field and predicates corresponding to the main words from the first knowledge graph;
and replacing at least one general query sentence pattern in the query sentence template by the extracted main words and the predicates corresponding to the main words to generate the corpus of the target field.
The corpus generating method provided by the embodiment of the invention can acquire the query statement template in the target field and further extract the main words and the predicates corresponding to the main words in the target field from the first knowledge map comprising massive prior knowledge in each field, because the query sentence template is a general query sentence pattern with an expansion sample, and the primary words of the target field and the predicates corresponding to the primary words extracted from the first knowledge graph are sufficient in quantity and large in field discrimination, therefore, the query sentence pattern is replaced by the main word extracted from the first knowledge graph and the predicate corresponding to the main word, the generated corpus is abundant in quantity and large in domain discrimination of the corpus, therefore, the technical effect of improving the field corpus quality is achieved, and the recognition accuracy of the model obtained based on corpus training is further improved due to the improvement of the corpus quality.
Meanwhile, the embodiment of the invention extracts the main words belonging to the target field from the first knowledge graph and the predicates corresponding to the main words to replace each query sentence pattern in the query sentence template, so that the generated corpus field discrimination is high, and when the generated corpus is put into a corpus, the manual inspection process can be reduced, and the labor cost is also reduced.
In another aspect, an embodiment of the present invention provides a model generation method for identifying a user intention, including:
the corpus generated by the corpus generation method provided by the embodiment of the invention is obtained, and the obtained corpus is used for model training of user intention recognition to obtain a model generated after training.
On the other hand, an embodiment of the present invention provides a corpus generating device, including:
the query sentence pattern generating unit is used for generating a query sentence pattern with an expansion sample for each field, and the query sentence pattern comprises a main word for describing the field and a predicate for describing the attribute of the main word;
the extraction unit is used for extracting the main words belonging to the target field and predicates corresponding to the main words from the first knowledge graph;
and the replacing unit is used for replacing at least one general query sentence pattern in the query sentence template by adopting the extracted main words and the predicates corresponding to the main words to generate the corpus of the target field.
In another aspect, an embodiment of the present invention provides a model generation apparatus for identifying a user intention, including:
the obtaining unit is used for obtaining the corpus generated by the corpus generating method in the embodiment of the invention;
and the model training unit is used for performing model training of user intention recognition by using the obtained linguistic data to obtain a model generated after training.
In another aspect, an embodiment of the present invention provides a computing device, including at least one processor and at least one memory, where the memory stores a computer program, and when the program is executed by the processor, the processor is caused to execute the steps of the corpus generating method in the embodiment of the present invention.
In another aspect, an embodiment of the present invention provides a storage medium, where the storage medium stores computer instructions, and when the computer instructions are executed on a computer, the computer is caused to execute the steps of the corpus generating method in the embodiment of the present invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention.
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention;
FIG. 2 is a flowchart of a corpus generating method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a first knowledge-graph structure provided by an embodiment of the present invention;
FIG. 4 is a flowchart of obtaining query sentence pattern templates according to an embodiment of the present invention;
FIG. 5 is a flowchart of another method for obtaining query pattern templates according to an embodiment of the present invention;
FIG. 6 is a flow chart of extracting a primary word from a first knowledge-graph according to an embodiment of the present invention;
FIG. 7 is a flowchart illustrating a corpus answer verification process in a first knowledge-graph according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating a corpus generating device according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a computing device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the technical solutions of the present invention. All other embodiments obtained by a person skilled in the art without any inventive work based on the embodiments described in the present application are within the scope of the protection of the technical solution of the present invention.
Some concepts related to the embodiments of the present invention are described below.
And (3) corpus: that is, linguistic materials are the content of linguistic research, and linguistic materials are the basic units constituting a corpus, for example, Query of a user, which can be a search word or a search sentence input by the user through a search engine, and the Query expressed by the above natural language can be called as linguistic materials, which are the basis for training a deep learning classification model.
Query: that is, a query is a message sent by a search engine or database to find a particular document, website, record, or series of records in a database, and may also be colloquially understood as a user's query corpus.
Entity: refers to a basic unit representing a concept.
Template: a general schema with extended examples.
Knowledge graph: the map is also called scientific knowledge map, is known as knowledge domain visualization or knowledge domain mapping map in the book intelligence field, is a series of different graphs for displaying the relation between the knowledge development process and the structure, describes knowledge resources and carriers thereof by using visualization technology, and excavates, analyzes, constructs, draws and displays knowledge and the mutual relation between the knowledge resources and the carriers.
BFQ: binary factory query, BFQ for short, is a Binary fact type Question, such as asking about an attribute of an aspect of an entity.
KB: knowledge Base, abbreviated KB, i.e. Knowledge Base.
KBQA: the Knowledge Base Question Answering is a given natural language Question, and the Question is semantically understood and analyzed, so that the Knowledge Base is used for inquiring and reasoning to obtain an answer.
In the prior art, the linguistic data in the field of Query log generation is mainly queried based on a user, however, the Query records in the Query log are limited in number, and then specifically to a certain field, the Query records belonging to the field are fewer, so that the linguistic data in the field generated based on the Query log are insufficient in number, and the Query records in the Query log relate to each field, and the Query records in each field are not subjected to field distinction, so that the problem of low domain distinction degree of the linguistic data also exists by querying the linguistic data in the field generated by the Query log through the user.
Therefore, the embodiment of the invention provides a corpus generating method, which can acquire a query statement template in a target field and further extract a main word and a predicate corresponding to the main word in the target field from a first knowledge graph comprising massive known knowledge in each field, because the query sentence template is a general query sentence pattern with an expansion sample, and the primary words of the target field and the predicates corresponding to the primary words extracted from the first knowledge graph are sufficient in quantity and large in field discrimination, therefore, the main words extracted from the first knowledge graph and the predicates corresponding to the main words are used for replacing each universal query sentence pattern in the query sentence template, the generated language material is rich in quantity and large in domain discrimination, therefore, the corpus quality of the field is improved, and the identification accuracy of the model obtained based on corpus training is further improved due to the improvement of the corpus quality. Furthermore, because the domain discrimination of the generated corpus is high, the manual inspection process can be reduced when the generated corpus is stored in the corpus, and therefore, the labor cost is also reduced.
The corpus generating method in the embodiment of the present invention may be applied to an application scenario as shown in fig. 1, where the application scenario includes a corpus generating computing device, a model training computing device, and a user terminal 12. In the embodiment shown in fig. 1, the corpus generating computing device may be a corpus generating server 10, and the model training computing device may be a model training server 11. The corpus generating server 10 may be a server, or a server cluster or cloud computing center formed by a plurality of servers, the corpus generating server 10 communicates with the model training server 11 through a network, the model training server 11 communicates with the user terminal 12 through a network, and the model training server 11 is a server or a server cluster or cloud computing center formed by a plurality of servers. The user terminal 12 is an electronic device with network communication capability, which may be a smart phone, a tablet computer, a portable personal computer or other intelligent terminal, etc.
In this scenario, the corpus generating server 10 may generate corpuses of each field according to the corpus generating method provided in the embodiment of the present invention, and the model training server 11 may obtain the generated corpuses of each field from the corpus generating server 10, and further obtain a model for identifying the user's intention in each field by using the generated corpuses of each field for training, so as to perform intention identification on the user query input through the user terminal 12.
It should be noted that the above-mentioned application scenarios are only presented to facilitate understanding of the spirit and principles of the present invention, and the present invention is not limited in this respect. Rather, embodiments of the present invention may be applied in any scenario where applicable.
The corpus generating method provided by the embodiment of the present invention is described below with reference to the application scenario shown in fig. 1.
As shown in fig. 2, a corpus generating method provided in an embodiment of the present invention includes:
step 201: and acquiring a query statement template of the target field.
The query statement template of each field comprises at least one general query statement with an expansion sample, which is set for the field, and each general query statement comprises a main word describing the field and a predicate describing the attribute of the main word.
In the embodiment of the present invention, the target field may be any field, for example, the person knowledge question answering field, the sports knowledge question answering field, the financial knowledge question answering field, and the like in the knowledge question answering KBQA, and when the corpus of the target field needs to be generated, the corpus generating server may obtain the query sentence template of the target field.
In the embodiment of the present invention, the manner of obtaining the query statement template in the target field includes multiple manners, for example, the query statement template in the target field may be split according to a regularization template in the target field, where the split query statement template includes at least one general query sentence pattern, and each general query sentence pattern includes a main word and a predicate corresponding to the main word; in the embodiment of the invention, the query statement template in the target field can be screened from the corpus template in the target field, and similarly, the screened query statement template comprises at least one general query sentence pattern, and each general query sentence pattern comprises a main word and a predicate corresponding to the main word; in the embodiment of the invention, the query statement template in the target field can be obtained according to the regularization template in the target field and the corpus template in the target field at the same time.
The regularization template of one field is a template with strong expansibility belonging to the field, for example, when the target field is a character knowledge question and answer field, the regularization template of the character knowledge question and answer field is: ([ search ] | [ hellpme ] |) [ person ] "(of | get |) [ naturaltertributes ] (how much | how high | how many feet of one of how high | of one of how many feet of one of how high) of one of [ yuqici ] |), the general query sentence pattern in the query sentence template obtained by splitting according to the regularization template may be: to find how much of naturaturants of person, i want to know how much of naturaturants of person, and so on (the detailed process of resolution will be described later).
A corpus template in a field is a template generated by labeling a corpus when the corpus in the field is put in storage, and generally, the corpus template in the same field is simpler than a regularization template, for example, when the corpus "how many the height of the corpus is" is stored in a corpus in a person knowledge question and answer field, the template labeling may be performed on how many the height of the corpus "is" to generate a corpus template of "how many the [ naturataribuntes ] of [ person ], where [ person ] is a main word and [ naturatarimbutetributes ] is a predicate describing the attribute of [ person ], and generally, a field includes a plurality of corpus templates, so that a template including the predicate corresponding to the main word and the main word can be screened out from the corpus template in a target field as a general query sentence pattern in a query template (a specific screening process will be described in detail below).
Step 202: and extracting the main words belonging to the target field and predicates corresponding to the main words from the first knowledge graph.
In the embodiment of the present invention, the first knowledge graph includes a large amount of existing knowledge in each field, the existing knowledge in each field mainly includes a main word in the field and a predicate corresponding to the main word, and may also include an object corresponding to the predicate, where the main word in the field may also be referred to as an entity of the field, the predicate is a word for describing an attribute of the main word, and the object is an answer corresponding to an attribute represented by the predicate.
For example, as shown in fig. 3, a schematic diagram of a partial knowledge graph in which a main word in the person knowledge-answering field included in the first knowledge graph is "zhang san", that is, "zhang san" is a main word or an entity in the person knowledge-answering field, and "height", "wife", "daughter", "nationality", "ethnicity", and "constellation" are a plurality of predicates describing attributes of the main word "zhang san", and "174 cm", "lililiri", "heading", "china", "han nationality", and "O type" are answers corresponding to the attributes represented by the predicates.
Therefore, in the embodiment of the invention, the main words belonging to the target field and the predicates corresponding to the main words can be extracted from the first knowledge graph, and because the existing knowledge is subjected to field classification in the first knowledge graph, the predicates corresponding to the main words and the main words of the target field extracted from the first knowledge graph are not only sufficient in quantity, but also large in field discrimination.
It should be noted that, in the embodiment of the present invention, step 202 may be executed after step 201, may be executed before step 201, and may also be executed simultaneously with step 201.
Step 203: and replacing at least one general query sentence pattern in the query sentence template by the extracted main words and the predicates corresponding to the main words to generate the corpus of the target field.
In the embodiment of the present invention, after the query statement template in the target field is acquired according to step 201 and the subject word and the predicate corresponding to the subject word in the target field are extracted according to step 202, for any subject word in the extracted subject words, each general query statement in at least one general query statement may be replaced by the any subject word and the predicate corresponding to the any subject word, including replacing the subject word of each general query statement and the predicate corresponding to the subject word, so as to generate a plurality of corpora in the target field.
For example, in the field of human knowledge question answering, the query sentence template includes the following general query sentences: finding the number of naturaattributes of [ person ], and wanting to know the number of naturaatttributes of [ person ], wherein the main word extracted from the first knowledge map is 'zhang', and the predicate corresponding to the extracted 'zhang san' comprises 'height' and 'nationality', so that the main word [ person ] in each general query sentence pattern in the query sentence template can be replaced by the extracted main word 'zhang' and the predicate 'height' corresponding to the main word 'zhang' and the 'nationality' in each general query sentence pattern in the query sentence template are replaced by the predicate 'height' and the 'nationality', so as to generate the linguistic data of the person knowledge question and answer field, namely the generated linguistic data comprises: the height of Zusanli is what, the nationality of Zusanli is what, the height of Zusanli is what, and the nationality of Zusanli is what.
Therefore, by the method, the query statement template in the target field can be obtained, the main words and the predicates corresponding to the main words in the target field can be further extracted from the first knowledge map comprising massive known knowledge in each field, because the query sentence template is a general query sentence pattern with an expansion sample, and the primary words of the target field and the predicates corresponding to the primary words extracted from the first knowledge graph are sufficient in quantity and large in field discrimination, therefore, the main words extracted from the first knowledge graph and the predicates corresponding to the main words are used for replacing the predicates corresponding to the main words and the main words in the query sentence template, the generated linguistic data is rich in quantity and large in domain discrimination of the linguistic data, therefore, the technical effect of improving the corpus quality in the field is achieved, and the recognition accuracy of the model obtained based on corpus training is further improved due to the improvement of the corpus quality.
Meanwhile, the embodiment of the invention extracts the main words belonging to the target field and the predicates corresponding to the main words from the first knowledge graph to replace the main words of each general query sentence pattern and the predicates corresponding to the main words in the query sentence template, so that the field discrimination of the generated linguistic data is high, and when the generated linguistic data is put into a corpus, the manual inspection process can be reduced, and the labor cost is also reduced.
In an optional manner, in the embodiment of the present invention, when the query statement template in the target field is split according to the regularization template in the target field, step 201 may be performed according to a flow shown in fig. 4.
The process shown in fig. 4 includes:
step 401: acquiring a regularization template of a target field;
step 402: aiming at any part of speech except a main word and a predicate in the regularization template, selecting a word from a plurality of words represented by any part of speech to replace any part of speech to obtain a regularization template after replacement;
step 403: and splitting the replaced regularization template, and acquiring at least one general query sentence pattern from the split result, wherein the at least one general query sentence pattern comprises the main words and predicates corresponding to the main words.
Wherein each of the obtained at least one general query sentence pattern forms one general query sentence pattern in the query sentence template.
In the flow shown in fig. 4, when a corpus of a target field needs to be generated, a corpus generation server may obtain a regularization template of the target field, where, in general, the regularization template of a field includes multiple parts of speech including a main word and a predicate corresponding to the main word, and each part of speech may represent multiple specific words.
For example, when the target domain is the human knowledge-answering domain, the regularization template for obtaining the human knowledge-answering domain is: ([ search ] | [ hellpme ] |) [ person ] "(of | get |) [ naturaltertributes ] (how many | how high up to | how many feet up to | how many) ([ yuqici ] |). Wherein "[ ]" in the regularization template can represent a part of speech, such as a main word, a predicate, an exclamation word, an adverb and the like, "()" represents that Chinese characters are possible in the middle, and "|" represents the selection of one more.
The person in the regularization template is [ person ] which is a main word and represents the name of a person, for example, the word represented by person can be: zhang III, Li Si and Wang Ma Zi …; naturataritus is a predicate describing the attributes of the principal word person, i.e. the general natural attributes characterizing a person, such as words characterized by naturataritus may have: … for height, sex, son and wife; [ search ], [ yuqici ] and [ hellpme ] are other types of parts of speech than the main word and predicate, and the words characterized by search can be: "find once", "find down", "search for once", and "see down" …; the hellpme characterized words may be: "i want to know", "please appeal me", "can say me down" …; the words characterized by yuqici may be: "o", "calash", "woollen" …; (di de) indicates that the three words may have one or none in between, (how much | how high | to how many | feet) indicates that one in between the words must be included.
In the embodiment of the present invention, because the main word in the regularization template and the predicate corresponding to the main word are the key for generating the corpus, for convenience of subsequent processing, the obtained regularization template can be simplified to become a binary template containing the predicate corresponding to the main word and the main word, that is, any part of speech in the obtained regularization template except the main word and the predicate is to be replaced by one word from a plurality of words represented by the any part of speech, so as to obtain the regularization template after replacement.
The target field is continuously taken as the character knowledge question-answering field, and the regularization template of the obtained character knowledge question-answering field is as follows: for example, ([ yuqici ] |) in other word classes except for the main word and the predicate in the regularization template, a word may be randomly selected from a plurality of words represented by any word class to replace the word class, or a word with the highest heat value may be selected from the plurality of words represented by any word class to replace the word class.
For example, the part-of-speech search in the regularization template may be replaced by a word "find one" (here, it is assumed that the hot value of "find one" in the plurality of words represented by the word is the highest), and a word may be randomly selected from the plurality of words represented by the search for replacement; similarly, for the part-of-speech hellpme in the regularization template, the word "i wants to know" with the highest heat value in the plurality of words represented by the word "hellpme" (it is assumed here that the heat value of "i wants to know" in the plurality of words represented by the word "hellpme" is the highest), and a word can be randomly selected from the plurality of words represented by the hellpme for replacement; similarly, for the part of speech yuqici in the regularization template, the word "o" with the highest heat value in the plurality of words represented by the word "yuqici" may be selected (it is assumed here that the heat value of "o" in the plurality of words represented by the word "o" is the highest), and one word may be randomly selected from the plurality of words represented by yuqici for replacement.
Assuming that a word with the highest heat value in the plurality of words characterized by the search is selected to replace the search, a word with the highest heat value in the plurality of words characterized by the hellpme is selected to replace the hellpme, and a word with the highest heat value in the plurality of words characterized by the yuqici is selected to replace the yuqici, the regularization template after replacement is: (find | I want to know | di | of | person) of [ person ] (naturaalttributes ] (how much | is how high | to how many | is how many | to how many | feet) (a |).
Splitting the replaced regularization template according to the regularization rule, wherein the regularization template can be split into the following components: finding out how many naturaattributes of [ person ], i want to know how many a of [ naturaatttributes ] of [ person ], and a series of simple binary templates such as [ naturaatttributes ] of [ person ], selecting a binary template including a main word and a predicate corresponding to the main word as a general query sentence pattern from a plurality of binary templates obtained by splitting, namely splitting results, so as to obtain at least one general query sentence pattern, wherein each obtained general query sentence pattern forms one general query sentence pattern in the query sentence templates.
Alternatively, when the query statement template in the target field is screened from the corpus templates in the target field in the embodiment of the present invention, step 201 may be performed according to the flow shown in fig. 5.
The process shown in fig. 5 includes:
step 501: obtaining a corpus template of a target field;
step 502: and screening out a query statement template comprising the main words and predicates corresponding to the main words from the corpus templates.
In the flow shown in fig. 5, when a corpus of a target field needs to be generated, the corpus generation server may obtain a corpus template of the target field, where the corpus template of the target field refers to a template generated by template-labeling the corpus when the corpus of the target field is put into storage, for example, when the target field is a domain of a human knowledge question and answer, how many "the height of the corpus" is stored in a corpus of the domain of a human knowledge question and answer, where "how many" the height of the corpus "is one person, and the height is one naturataributes, then template-labeling how many" the height of the corpus "is" may be performed, and a corpus template of "how many" the height of the corpus "is generated.
Generally, a plurality of corpus templates are included in one field, correspondingly, a plurality of corpus templates in the target field are obtained, at least one corpus template including a main word and a predicate corresponding to the main word can be screened from the plurality of acquired corpus templates, each corpus template in the at least one corpus template can be used as a general query sentence pattern, and each general query sentence pattern forms a general query sentence pattern in the query sentence template, so that the query sentence template in the target field is screened from the corpus templates in the target field.
An optional manner, in the embodiment of the present invention, when the query statement template in the target field is obtained simultaneously according to the regularization template in the target field and the corpus template in the target field, any part of speech replacement except for the main word and the predicate, and the regularization template after the splitting replacement may be sequentially performed on the obtained regularization template in the target field according to the flow shown in fig. 4, so as to obtain a plurality of binarized templates including the main word and the predicate corresponding to the main word according to the splitting result; the corpus templates in the target field may also be obtained simultaneously as in the process shown in fig. 5, and at least one corpus template including the main word and the predicate corresponding to the main word is screened out from the obtained corpus templates, and then the plurality of binarized templates and the screened at least one corpus template are collectively used as a general query sentence pattern in the query sentence pattern template.
An alternative method, step 202 in the embodiment of the present invention may also be executed according to the flow shown in fig. 6, where the flow shown in fig. 6 includes:
step 601: determining type Identification (ID) of a main word belonging to a target field in a first knowledge graph;
step 602: and extracting the main words with the highest heat value and predicates corresponding to the main words from the first knowledge graph according to the determined IDs and the preset number.
In the embodiment of the present invention, each domain in the first knowledge graph includes hundreds of thousands or even millions of main words, in order to distinguish the main words in different domains, domain type labeling may be performed on the main words in different domains in the first knowledge graph, that is, a domain type identifier ID (hereinafter, abbreviated as ID) is set for each main word in the first knowledge graph, for example, the IDs of all the main words in the first knowledge graph belonging to the person knowledge question and answer domain are set to 15, the IDs of all the main words in the first knowledge graph belonging to the physical knowledge question and answer domain are set to 20, and the IDs of all the main words in the first knowledge graph belonging to the financial knowledge question and answer domain are set to 33.
Therefore, the type identification ID of the main word belonging to the target field in the first knowledge graph can be determined, so that the main word belonging to the target field and the predicate corresponding to the main word can be accurately extracted from the first knowledge graph, and then the corpus with high field discrimination can be generated, and the recognition accuracy of the model trained based on the corpus can be improved.
In the embodiment of the present invention, it is further considered that the data amount of the main words included in each field in the first knowledge graph is large, the main words which are frequently queried by people are generally words with a high heat value, the main words with the low heat value are rarely queried by people, and in order to increase the speed of generating the corpus, a suitable main word extraction number, that is, a preset number, for example, 1000 main words or 10000 main words, may be preset.
In an optional manner, in this embodiment of the present invention, after the target corpus is generated, the process shown in fig. 7 may be further executed, where the process shown in fig. 7 includes:
step 701: obtaining answers corresponding to the generated corpora from a second knowledge graph, wherein the second knowledge graph is different from the first knowledge graph;
step 702: determining whether an answer corresponding to the generated corpus exists in the first knowledge graph, if so, executing step 703, otherwise, executing step 704;
step 703: checking answers corresponding to the generated corpora in the first knowledge graph by using answers obtained from the second knowledge graph;
step 704: and storing the answers obtained from the second knowledge graph in the first knowledge graph.
In the embodiment of the present invention, in order to verify whether the answer corresponding to the generated corpus exists in the first knowledge graph or whether the answer is accurate when the answer corresponding to the generated corpus exists, the answer corresponding to the generated corpus may be acquired from the second knowledge graph through an interface communicating with the second knowledge graph, wherein the number of the second knowledge graph may be one or more.
In the embodiment of the present invention, if it is determined that there is an answer corresponding to the generated corpus in the first knowledge graph, the accuracy of the answer corresponding to the generated corpus in the first knowledge graph may be verified by using the answer corresponding to the generated corpus obtained from the second knowledge graph, for example, how many "zhang san blood type is" is a corpus generated in the embodiment of the present invention, the answer corresponding to the corpus may be obtained from the first knowledge graph, if there is an answer corresponding to the corpus in the first knowledge graph and the answer is O type, if the answer corresponding to the corpus obtained from the second knowledge graph is a type, the answer O type corresponding to the corpus in the first knowledge graph may be changed to a type, if the answer corresponding to the corpus obtained from the second knowledge graph is also O type, the answer in the first knowledge graph is considered to be a correct answer, at this point, no processing may be performed.
If it is determined that the answer corresponding to the generated corpus does not exist in the first knowledge graph, the answer corresponding to the generated corpus, which is obtained from the second knowledge graph, may be stored in the first knowledge graph as the answer corresponding to the corpus in the first knowledge graph, for example, "how many blood groups of zhang san" is one corpus generated in the embodiment of the present invention, assuming that the answer corresponding to the corpus obtained from the second knowledge graph is an O type, if the answer corresponding to the corpus is not obtained from the first knowledge graph, the answer O type obtained from the second knowledge graph may be added to the first knowledge graph, thereby perfecting the answer corresponding to the corpus generated in the first knowledge graph.
An optional mode is that, in the embodiment of the present invention, after the corpus of the target field is generated, the generated corpus may be added to the corpus of the target field, and the corpus of the corpus to which the corpus generated according to the embodiment of the present invention is added is used to perform user intention recognition model training in the target field, such as deep classification model training, to generate a new model.
In practical application, the corpus generating method in the embodiment of the present invention may be applied to corpus mining application scenarios in each field to improve the quality of a model based on mining corpus training in each field, and in practical application, a programming language such as C language, C + + language, Java language, etc. may also be used.
Based on the same inventive concept, an embodiment of the present invention provides a corpus generating device, and specific implementation of a corpus generating method of the device may refer to the description of the foregoing method embodiment, and repeated details are not repeated, as shown in fig. 8, the device includes:
an obtaining unit 80, configured to obtain query statement templates of target domains, where a query statement template of each domain includes at least one general query sentence pattern with an expansion sample set for the domain, and each general query sentence pattern includes a main word describing the domain and a predicate describing an attribute of the main word;
the extracting unit 81 is configured to extract a main word belonging to the target field and a predicate corresponding to the main word from the first knowledge graph;
and a replacing unit 82, configured to replace at least one general query sentence pattern in the query sentence template with the extracted main word and the predicate corresponding to the main word, so as to generate a corpus of the target field.
Optionally, the obtaining unit 80 is further configured to:
splitting a query statement template of the target field according to the regularization template of the target field; and/or
And screening out a query statement template of the target field from the corpus templates of the target field.
Optionally, the obtaining unit 80 is further configured to:
acquiring a regularization template of a target field;
aiming at any part of speech except a main word and a predicate in the regularization template, selecting a word from a plurality of words represented by any part of speech to replace any part of speech to obtain the regularization template after replacement;
splitting the replaced regularization template, and acquiring at least one general query sentence pattern from a split result, wherein the at least one general query sentence pattern comprises main words and predicates corresponding to the main words; wherein each of the obtained at least one general query sentence pattern forms one general query sentence pattern in the query sentence template.
Optionally, the obtaining unit 80 is further configured to:
obtaining a corpus template of the target field;
and screening out a query statement template comprising the main words and predicates corresponding to the main words from the corpus templates.
Optionally, the extracting unit 81 is further configured to:
determining type Identification (ID) of a main word belonging to the target field in the first knowledge graph;
and extracting the main words with the highest heat value and predicates corresponding to the main words from the first knowledge graph according to the IDs and the preset number.
Optionally, the replacing unit 82 is further configured to:
and aiming at any main word in the extracted main words with preset number, replacing each general query sentence pattern in the at least one general query sentence pattern by adopting the main word and the predicate corresponding to the main word, wherein the main word of each general query sentence pattern and the predicate corresponding to the main word are replaced, so that the linguistic data of the target field is generated.
Optionally, the obtaining unit 80 is further configured to:
obtaining answers corresponding to the generated corpora from a second knowledge graph, wherein the second knowledge graph is different from the first knowledge graph;
if the answer corresponding to the generated corpus exists in the first knowledge graph, checking the answer corresponding to the generated corpus in the first knowledge graph by adopting the answer obtained from the second knowledge graph;
and if the answer corresponding to the generated corpus does not exist in the first knowledge graph, storing the answer acquired from the second knowledge graph in the first knowledge graph.
Based on the same inventive concept, the embodiment of the present invention provides a model generation method for recognizing a user intention, and the specific implementation of the method may refer to the description of model training for a corpus based on generation in the above method embodiment section, and repeated details are not repeated, and the method includes:
obtaining the corpus generated by the method provided by the embodiment of the invention, and using the obtained corpus to carry out model training for user intention recognition to obtain a model generated after training.
Based on the same inventive concept, an embodiment of the present invention provides a model generation apparatus for recognizing a user intention, including:
the obtaining unit is used for obtaining the corpus generated by the method provided by the embodiment of the invention;
and the model training unit is used for performing model training of user intention recognition by using the obtained linguistic data to obtain a model generated after training.
Based on the same inventive concept, the embodiment of the present invention provides a computing device, as shown in fig. 9, including at least one processor 90 and at least one memory 91, where the memory 91 stores a computer program, and when the program is executed by the processor 90, the processor 90 executes the steps of the corpus generating method provided in the embodiment of the present invention.
Based on the same inventive concept, an embodiment of the present invention provides a storage medium, where the storage medium stores computer instructions, and when the computer instructions are run on a computer, the computer is caused to execute the steps of the corpus generating method according to the embodiment of the present invention.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (15)

1. A corpus generating method, comprising:
acquiring query statement templates of a target field, wherein the query statement template of each field comprises at least one general query sentence pattern with an expansion sample, which is set for the field, and each general query sentence pattern comprises a main word describing the field and a predicate describing the attribute of the main word; the obtaining of the query statement template in the target field specifically includes: splitting a query statement template of the target field according to the regularization template of the target field, wherein the regularization template is obtained by selecting a word from a plurality of words represented by any part of speech except a main word and a predicate in the regularization template to replace the part of speech;
extracting main words belonging to the target field and predicates corresponding to the main words from the first knowledge graph;
and replacing at least one general query sentence pattern in the query sentence template by the extracted main words and the predicates corresponding to the main words to generate the corpus of the target field.
2. The method of claim 1, wherein the obtaining a query statement template for a target domain, further comprises:
and screening out the query statement template of the target field from the corpus templates of the target field.
3. The method according to claim 1, wherein the splitting the query statement template of the target domain according to the regularization template of the target domain specifically includes:
acquiring a regularization template of the target field;
aiming at any part of speech except a main word and a predicate in the regularization template, selecting a word from a plurality of words represented by any part of speech to replace any part of speech to obtain the regularization template after replacement;
splitting the replaced regularization template, and acquiring at least one general query sentence pattern from a split result, wherein the at least one general query sentence pattern comprises main words and predicates corresponding to the main words;
wherein each of the obtained at least one general query sentence pattern forms one general query sentence pattern in the query sentence template.
4. The method according to claim 2, wherein the step of screening the query sentence template of the target field from the corpus templates of the target field specifically comprises:
obtaining a corpus template of the target field;
and screening out a query statement template comprising the main words and predicates corresponding to the main words from the corpus templates.
5. The method according to any one of claims 1 to 4, wherein the extracting, from the first knowledge-graph, the main word belonging to the target domain and the predicate corresponding to the main word specifically comprises:
determining type Identification (ID) of a main word belonging to the target field in the first knowledge graph;
and extracting the main words with the highest heat value and predicates corresponding to the main words from the first knowledge graph according to the IDs and the preset number.
6. The method according to claim 5, wherein the generating the corpus of the target domain by replacing at least one general query sentence pattern in the query sentence template with the extracted main word and the predicate corresponding to the main word comprises:
and aiming at any main word in the extracted main words with preset number, replacing each general query sentence pattern in the at least one general query sentence pattern by adopting the main word and the predicate corresponding to the main word, wherein the main word of each general query sentence pattern and the predicate corresponding to the main word are replaced, so that the linguistic data of the target field is generated.
7. The method of claim 1, wherein after generating the corpus of the target domain, the method further comprises:
obtaining answers corresponding to the generated corpora from a second knowledge graph, wherein the second knowledge graph is different from the first knowledge graph;
if the answer corresponding to the generated corpus exists in the first knowledge graph, checking the answer corresponding to the generated corpus in the first knowledge graph by adopting the answer obtained from the second knowledge graph;
and if the answer corresponding to the generated corpus does not exist in the first knowledge graph, storing the answer obtained from the second knowledge graph in the first knowledge graph.
8. A model generation method for recognizing a user's intention, comprising:
obtaining the corpus according to any one of claims 1 to 7, performing model training for user intention recognition using the obtained corpus, and obtaining a model generated after training.
9. A corpus generating device, comprising:
the query sentence pattern generating unit is used for generating a query sentence pattern with an expansion sample for each field, and the query sentence pattern comprises a main word for describing the field and a predicate for describing the attribute of the main word; the obtaining of the query statement template in the target field specifically includes: splitting a query statement template of the target field according to the regularization template of the target field, wherein the regularization template is obtained by selecting a word from a plurality of words represented by any part of speech except a main word and a predicate in the regularization template to replace the part of speech;
the extraction unit is used for extracting the main words belonging to the target field and predicates corresponding to the main words from the first knowledge graph;
and the replacing unit is used for replacing at least one general query sentence pattern in the query sentence template by adopting the extracted main words and the predicates corresponding to the main words to generate the corpus of the target field.
10. The apparatus of claim 9, wherein the obtaining unit is further configured to:
and screening out the query statement template of the target field from the corpus templates of the target field.
11. The apparatus of claim 9, wherein the obtaining unit is further configured to:
acquiring a regularization template of the target field;
aiming at any part of speech except a main word and a predicate in the regularization template, selecting a word from a plurality of words represented by any part of speech to replace any part of speech to obtain the regularization template after replacement;
splitting the replaced regularization template, and acquiring at least one general query sentence pattern from a split result, wherein the at least one general query sentence pattern comprises main words and predicates corresponding to the main words;
wherein each of the obtained at least one general query sentence pattern forms one general query sentence pattern in the query sentence template.
12. The apparatus of claim 10, wherein the obtaining unit is further configured to:
obtaining a corpus template of the target field;
and screening out a query statement template comprising the main words and predicates corresponding to the main words from the corpus templates.
13. A model generation apparatus that recognizes a user's intention, comprising:
an acquisition unit configured to acquire the corpus according to any one of claims 1 to 7;
and the model training unit is used for performing model training of user intention recognition by using the obtained linguistic data to obtain a model generated after training.
14. A computing device comprising at least one processor and at least one memory, wherein the memory stores a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 7.
15. A storage medium storing computer instructions which, when executed on a computer, cause the computer to perform the steps of the method according to any one of claims 1 to 7.
CN201811232263.XA 2018-10-22 2018-10-22 Corpus generation method and device, computing equipment and storage medium Active CN109408821B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811232263.XA CN109408821B (en) 2018-10-22 2018-10-22 Corpus generation method and device, computing equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811232263.XA CN109408821B (en) 2018-10-22 2018-10-22 Corpus generation method and device, computing equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109408821A CN109408821A (en) 2019-03-01
CN109408821B true CN109408821B (en) 2020-09-04

Family

ID=65468810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811232263.XA Active CN109408821B (en) 2018-10-22 2018-10-22 Corpus generation method and device, computing equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109408821B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489520B (en) * 2019-07-08 2023-05-16 平安科技(深圳)有限公司 Knowledge graph-based event processing method, device, equipment and storage medium
CN110807325B (en) * 2019-10-18 2023-05-26 腾讯科技(深圳)有限公司 Predicate identification method, predicate identification device and storage medium
CN111026834B (en) * 2019-12-10 2022-07-08 思必驰科技股份有限公司 Question and answer corpus generation method and system
CN113127610A (en) * 2019-12-31 2021-07-16 北京猎户星空科技有限公司 Data processing method, device, equipment and medium
CN111488463B (en) * 2020-04-09 2023-08-29 中国银行股份有限公司 Test corpus generation method and device and electronic equipment
CN111897840A (en) * 2020-08-14 2020-11-06 北京字节跳动网络技术有限公司 Data searching method and device, electronic equipment and storage medium
CN113158653B (en) * 2021-04-25 2021-09-07 北京智源人工智能研究院 Training method, application method, device and equipment for pre-training language model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107004000A (en) * 2016-06-29 2017-08-01 深圳狗尾草智能科技有限公司 A kind of language material generating means and method
CN108038234A (en) * 2017-12-26 2018-05-15 众安信息技术服务有限公司 A kind of question sentence template automatic generation method and device
CN108345640A (en) * 2018-01-12 2018-07-31 上海大学 A kind of question and answer building of corpus method based on neural network semantic analysis
CN108427722A (en) * 2018-02-09 2018-08-21 卫盈联信息技术(深圳)有限公司 intelligent interactive method, electronic device and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170024375A1 (en) * 2015-07-26 2017-01-26 Microsoft Technology Licensing, Llc Personal knowledge graph population from declarative user utterances
CN105608070B (en) * 2015-12-21 2019-01-25 中国科学院信息工程研究所 A kind of character relation abstracting method towards headline
CN105868313B (en) * 2016-03-25 2019-02-12 浙江大学 A kind of knowledge mapping question answering system and method based on template matching technique
US10423614B2 (en) * 2016-11-08 2019-09-24 International Business Machines Corporation Determining the significance of an event in the context of a natural language query
CN106776523B (en) * 2017-01-22 2020-04-07 百度在线网络技术(北京)有限公司 Artificial intelligence-based news quick report generation method and device
CN108376160B (en) * 2018-02-12 2022-02-18 北京大学 Chinese knowledge graph construction method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107004000A (en) * 2016-06-29 2017-08-01 深圳狗尾草智能科技有限公司 A kind of language material generating means and method
CN108038234A (en) * 2017-12-26 2018-05-15 众安信息技术服务有限公司 A kind of question sentence template automatic generation method and device
CN108345640A (en) * 2018-01-12 2018-07-31 上海大学 A kind of question and answer building of corpus method based on neural network semantic analysis
CN108427722A (en) * 2018-02-09 2018-08-21 卫盈联信息技术(深圳)有限公司 intelligent interactive method, electronic device and storage medium

Also Published As

Publication number Publication date
CN109408821A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109408821B (en) Corpus generation method and device, computing equipment and storage medium
CN110837550B (en) Knowledge graph-based question answering method and device, electronic equipment and storage medium
JP6634515B2 (en) Question clustering processing method and apparatus in automatic question answering system
CN106874279B (en) Method and device for generating application category label
US9239875B2 (en) Method for disambiguated features in unstructured text
CN108280114B (en) Deep learning-based user literature reading interest analysis method
US20210097089A1 (en) Knowledge graph building method, electronic apparatus and non-transitory computer readable storage medium
KR20190142287A (en) Method for recommending related problem based on meta data
CN108345686B (en) Data analysis method and system based on search engine technology
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN109783631B (en) Community question-answer data verification method and device, computer equipment and storage medium
CN108305180B (en) Friend recommendation method and device
CN104915426B (en) Information sorting method, the method and device for generating information sorting model
CN109508458B (en) Legal entity identification method and device
CN109977291B (en) Retrieval method, device and equipment based on physical knowledge graph and storage medium
CN109522397B (en) Information processing method and device
CN114647713A (en) Knowledge graph question-answering method, device and storage medium based on virtual confrontation
CN106407316B (en) Software question and answer recommendation method and device based on topic model
CN109697676B (en) User analysis and application method and device based on social group
CN110968664A (en) Document retrieval method, device, equipment and medium
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
CN110209780A (en) A kind of question template generation method, device, server and storage medium
CN111597336A (en) Processing method and device of training text, electronic equipment and readable storage medium
CN110929526A (en) Sample generation method and device and electronic equipment
CN112949305B (en) Negative feedback information acquisition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant