CN102368260A - Method and device of producing domain required template - Google Patents

Method and device of producing domain required template Download PDF

Info

Publication number
CN102368260A
CN102368260A CN2011103088307A CN201110308830A CN102368260A CN 102368260 A CN102368260 A CN 102368260A CN 2011103088307 A CN2011103088307 A CN 2011103088307A CN 201110308830 A CN201110308830 A CN 201110308830A CN 102368260 A CN102368260 A CN 102368260A
Authority
CN
China
Prior art keywords
template
candidate
segment
query
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103088307A
Other languages
Chinese (zh)
Other versions
CN102368260B (en
Inventor
柴春光
黄际洲
时迎超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110308830.7A priority Critical patent/CN102368260B/en
Priority claimed from CN201110308830.7A external-priority patent/CN102368260B/en
Publication of CN102368260A publication Critical patent/CN102368260A/en
Application granted granted Critical
Publication of CN102368260B publication Critical patent/CN102368260B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device of producing a domain required template, wherein the method comprises the following steps of: A, obtaining candidate required templates of a special domain; B, extracting the characteristics of the candidate required templates; C, sorting the candidate required templates according to the extracted characteristics; and D, selecting the final required template as the template required in the special domain from the candidate required templates. With above mode, a universal method for producing the high-quality domain required template is realized, which provides a guarantee for a search engine to understand the purpose of acts of users.

Description

Method and device for generating domain demand template
[ technical field ] A method for producing a semiconductor device
The invention relates to a natural language processing technology, in particular to a method and a device for generating a domain requirement template.
[ background of the invention ]
The search engine provides great convenience for people to find required information. In a conventional manner in which a search engine provides information to a user, the search engine searches for an index containing a search keyword of the user and returns a relevant page matching the keyword to the user. For example, the search request (query) of the user is "beijing automobile 4S store recruitment sales leader", a search result page of a recruitment website is obtained, the user can enter the recruitment website by clicking the page, and then relevant information is filled in the recruitment website and searched in the website to obtain information really needed by the user. If the search engine can better understand the true purpose of the user in retrieval, the search engine can more accurately return information to the user that really meets the user's needs. Therefore, natural language processing is very important for search engines. In natural language processing, a domain-based requirement template may be employed to identify a user's search purpose. For example, the query of the user is "how to go from temple to western bill", if the query is matched with the requirement template of the traffic field, it can be known that the user has the requirement of the traffic field, so that the application related to the traffic field can be directly returned to the user. Therefore, whether a high-quality domain requirement template can be generated or not is very important for a search engine to correctly understand the search intention of a user.
When the domain requirement template is generated in the past, different excavation methods are generally adopted for different applications, so that a large amount of manpower and material resources are wasted, and the method for generating the domain requirement template is poor in adaptability and difficult to change correspondingly along with the change of the applications.
[ summary of the invention ]
The invention aims to solve the technical problem of providing a method and a device for generating a domain demand template so as to solve the defect of poor adaptability of the domain demand template generated by adopting the prior art.
The technical scheme adopted by the invention for solving the technical problem is to provide a method for generating a domain demand template, which comprises the following steps: A. acquiring a candidate demand template in a specific field; B. extracting features of the candidate requirement template, wherein the features at least comprise: at least one of a similarity characteristic for representing the closeness between the candidate demand template and the specific field, a generalization capability characteristic for representing the capability of the candidate demand template covering the query of the user search request, and a boundary word characteristic for representing the influence of the non-generalization words in the candidate demand template on the correctness of the candidate demand template; C. sorting the candidate demand templates by using the extracted features; D. and selecting the final demand template from the candidate demand templates according to the sequencing result to serve as the demand template in the specific field.
According to a preferred embodiment of the present invention, the step a comprises: A1. selecting a query matched with a preset limiting word in the specific field from user queries from the search logs; A2. and replacing the part matched with the preset slot keyword in the specific field in the selected query with a wildcard character to obtain a candidate demand template.
According to a preferred embodiment of the present invention, after the step a2, the method further includes: and according to the preset requirement on the number of the slots in the specific field, filtering out a candidate requirement template which does not meet the requirement on the number of the slots from the candidate requirement templates obtained in the step A2.
According to a preferred embodiment of the present invention, the step of extracting the similarity feature of the candidate requirement template W includes: acquiring the core word vector of the W and the core word vector of the specific field; and calculating the similarity between the core word vector of the W and the core word vector of the specific field, and taking the similarity as the similarity characteristic of the W.
According to a preferred embodiment of the present invention, the step of obtaining the core word vector of W includes: selecting N with the most query times from the query covered by the W in the search log1Query and in said N1Determining the core words and the weights of the core words from the search results returned by the search engine by the query to form the core word vector of W, wherein N is1Is a positive integer.
According to a preferred embodiment of the present invention, the step of obtaining the domain-specific core word vector comprises: and acquiring a search result returned by a search engine by using the seed query in the specific field, and determining the core words and the weights of the core words in the search result to form the core word vector in the specific field.
According to a preferred embodiment of the present invention, the method for obtaining the seed query in the specific field includes: the method I is characterized in that N with the most covering query numbers in a search log is selected from all candidate requirement templates contained in the specific field2A candidate requirement template, and for said N2Selecting M with the most query times from the query covered by each candidate demand template1As seed query, N2And M1Is a positive integer; or, in the second mode, combining the preset groove key words of the specific field with the preset limiting words of the specific field to generate the seed query of the specific field; or, in a third mode, after part of the seed query is selected by using the first mode, replacing the slot keywords in the seed query selected by the first mode with other slot keywords in the slot keyword dictionary by using a preset slot keyword dictionary in the specific field to obtain an expanded seed query; the partial seed query and the extended seed query constitute the domain-specific seed query.
According to a preferred embodiment of the present invention, the step of extracting the generalization capability feature of the candidate requirement template W comprises: determining the groove keyword sequence corresponding to the W, counting the number of the groove keyword sequences different from each other in the groove keyword sequence corresponding to the W, and calculating the generalization ability characteristic of the W according to the number, wherein one groove keyword sequence corresponding to the W is a sequence formed by groove keywords in one query covered by the W in a search log.
According to a preferred embodiment of the present invention, the step of extracting the boundary word feature of the candidate requirement template W includes: segmenting all candidate requirement templates contained in the specific field into segments, selecting positive segments from the obtained segmented segments and determining the weight of each positive segment to generate a positive vector of the specific field, and selecting negative segments from the obtained segmented segments and determining the weight of each negative segment to generate a negative vector of the specific field; determining the weight of the segmentation segment of the W and using the segmentation segment of the W and the weight of the segmentation segment to form a vector of the W; calculating the similarity S between the vector of W and the positive vector1And the similarity S of said W and said negative vector2And according to said S1And said S2The difference value of W is obtained as the boundary word feature of W.
According to a preferred embodiment of the present invention, the process of generating the domain-specific positive vector and the domain-specific negative vector specifically includes: determining a slot keyword sequence corresponding to each segmentation segment, wherein one slot keyword sequence corresponding to one segmentation segment is a sequence consisting of slot keywords in one query covered by one candidate requirement template of the segmentation segment; t1, if all the groove keyword sequences corresponding to one segmentation segment are the same, taking the segmentation segment as a negative segment, wherein the weight of the negative segment is 1; t2, if all the slot key word sequences corresponding to one segmentation segment are not identical, but the proportion P of one slot key word sequence in all the slot key word sequences of the segmentation segment is larger than a preset first threshold value, taking the segmentation segment as a negative segment, and taking the weight of the negative segment as the proportion P; t3, determining the number of the different slot keyword sequences corresponding to each candidate demand template contained in the specific field, and obtaining the maximum value Z in the number1If a segmentation segment is notSatisfying the conditions in the T1 and the T2, and the number Z of the different groove keyword sequences corresponding to the segmentation segment2And said Z1If the ratio of the positive segment to the negative segment is greater than a preset second threshold, the segmentation segment is taken as a positive segment, and the weight of the positive segment is Z2And Z1The ratio of (a) to (b).
According to a preferred embodiment of the present invention, the step of determining the weight of the sliced piece of W includes: and counting the occurrence times of the segmentation segments of the W in the W and taking the times as the weight of the corresponding segmentation segments.
According to a preferred embodiment of the present invention, the step C comprises: selecting a standard template set from the candidate demand templates; using parameters corresponding to each feature extracted by the standard template set training, and taking parameter values of templates in the standard template set when the ranking of the templates in all candidate requirement templates cannot be advanced in the training as weights of the corresponding features; and calculating the score of the candidate demand templates by using the extracted features and the weight of the features, and sequencing the candidate demand templates according to the score.
According to a preferred embodiment of the present invention, the step of selecting a standard template set from the candidate requirement templates comprises: sorting the candidate demand templates based on the feature values respectively for each extracted feature, and taking the top N for each feature3Bit candidate requirement templates as template sets for corresponding features, where N3Is a positive integer; and taking the intersection set among the template sets of the features as a standard template set.
According to a preferred embodiment of the present invention, the step D comprises: will sort at the top N4Selecting a candidate demand template of bits as a final demand template, wherein N is4Is a positive integer; at top M with ordering2Obtaining a keyword set by the boundary words of the candidate demand template of the position, and positioning the sequence at the top N4Selecting the candidate demand templates of which the boundary words in the candidate demand templates after the position all belong to the keyword set as final demand templates, wherein the boundary words are the candidate demand templatesThe keywords are words synonymous with the boundary words or words with mutual information between the boundary words meeting requirements, M2Is a positive integer and M2Is less than or equal to N4
The invention also provides a device for generating the domain requirement template, which comprises: a candidate template obtaining unit, configured to obtain a candidate demand template in a specific field; the feature extraction unit is used for extracting features of the candidate demand template, wherein the feature extraction unit at least comprises one of a similarity feature extraction unit, a generalization capability feature extraction unit or a boundary word feature extraction unit, the similarity feature extraction unit is used for extracting similarity features representing the closeness between the candidate demand template and the specific field, the generalization capability feature extraction unit is used for extracting generalization capability features representing the ability of the candidate demand template covering the search request query of the user, and the boundary word feature extraction unit is used for extracting boundary word features representing the influence of non-generalized words in the candidate demand template on the correctness of the candidate demand template; the sorting unit is used for sorting the candidate demand templates by using the features extracted by the feature extraction unit; and the selecting unit is used for selecting the final requirement template from the candidate requirement templates as the requirement template in the specific field according to the sorting result of the sorting unit.
According to a preferred embodiment of the present invention, the candidate template obtaining unit includes: the restriction unit is used for selecting the query matched with the preset restriction words in the specific field from the user queries from the search logs; and the generalization unit is used for replacing the part matched with the preset slot keyword in the specific field in the query selected by the limitation unit with a wildcard character to obtain a candidate demand template.
According to a preferred embodiment of the present invention, the candidate template obtaining unit further includes a filtering unit, configured to filter, according to a preset slot number requirement for the specific field, a candidate requirement template that does not meet the slot number requirement from the candidate requirement templates obtained by the generalization unit.
According to a preferred embodiment of the present invention, the similarity extracting unit includes: the template word vector generating unit is used for acquiring a core word vector of the candidate requirement template W when the similarity characteristic of the W is extracted; the domain word vector generating unit is used for acquiring a core word vector of the specific domain; and the calculating unit is used for calculating the similarity between the core word vector of the W and the core word vector of the specific field, and taking the similarity as the similarity characteristic of the W.
According to a preferred embodiment of the present invention, the template word vector generating unit selects N with the largest number of queries from the queries covered by W in the search log1Query and in said N1Determining core words and weights of the core words from search results returned by a search engine by the query to form the core word vector of W, wherein N is the number of the core words1Is a positive integer.
According to a preferred embodiment of the present invention, the domain word vector generating unit obtains the search result returned by the search engine by using the seed query of the specific domain, and determines the core words and the weights of the core words in the search result to form the core word vector of the specific domain.
According to a preferred embodiment of the present invention, the manner of obtaining the seed query of the specific field by the field word vector generation unit includes: the method I is characterized in that N with the most covering query numbers in a search log is selected from all candidate requirement templates contained in the specific field2A candidate requirement template, and for said N2Selecting M with the most query times from the query covered by each candidate demand template1As seed query, N2And M1Is a positive integer; or, in the second mode, combining the preset groove key words of the specific field with the preset limiting words of the specific field to generate the seed query of the specific field; or, after part of seed query is selected by the first mode, the preset groove key words of the specific field are usedReplacing the slot keywords in the seed query selected by the first mode by other slot keywords in the slot keyword dictionary by the dictionary to obtain an expanded seed query; the partial seed query and the extended seed query constitute the domain-specific seed query.
According to a preferred embodiment of the present invention, when extracting the generalization capability feature of the candidate requirement template W, the generalization capability feature extraction unit determines the slot keyword sequence corresponding to W, counts the number of different slot keyword sequences in the slot keyword sequence corresponding to W, and calculates the generalization capability feature of W according to the number, where one slot keyword sequence of W is a sequence composed of slot keywords in one query covered by W in a search log.
According to a preferred embodiment of the present invention, the boundary word feature extraction unit includes: the segmentation unit is used for segmenting all candidate requirement templates contained in the specific field into segments; the positive and negative vector generating unit is used for selecting positive segments from all the segmentation segments obtained by the segmentation unit and determining the weight of the positive segments to generate a positive vector of the specific field, and selecting negative segments from all the segmentation segments and determining the weight of each negative segment to generate a negative vector of the specific field; the template vector generating unit is used for determining the weight of the segmentation segment of the W and forming the vector of the W by using the segmentation segment of the W and the weight of the segmentation segment when the boundary word feature of the candidate requirement template W is extracted; a similarity calculation unit for calculating a similarity S between the vector of W and the positive vector1And the similarity S of the vector of W and the negative vector2And according to said S1And said S2The difference value of W is obtained as the boundary word feature of W.
According to a preferred embodiment of the present invention, the positive-negative vector generating unit includes: a slot keyword sequence determining unit for determining the slot keyword sequence corresponding to each segmentation segment, wherein one slot keyword sequence corresponding to one segmentation segment is covered by one candidate requirement template containing the segmentation segmentA sequence of slot keywords in one query; the positive and negative segment selecting unit is used for selecting a positive segment and a negative segment from the segmentation segments and determining the weight of the positive segment and the negative segment according to the following modes: t1, if all the groove keyword sequences corresponding to one segmentation segment are the same, taking the segmentation segment as a negative segment, wherein the weight of the negative segment is 1; t2, if all the slot key word sequences corresponding to one segmentation segment are not identical, but the proportion P of one slot key word sequence in all the slot key word sequences of the segmentation segment is larger than a preset first threshold value, taking the segmentation segment as a negative segment, and taking the weight of the negative segment as the proportion P; t3, determining the number of the different slot keyword sequences corresponding to each candidate demand template contained in the specific field, and obtaining the maximum value Z in the number1If a segmentation segment does not satisfy the conditions in T1 and T2, the number Z of the different slot keyword sequences corresponding to the segmentation segment2And said Z1If the ratio of the positive segment to the negative segment is greater than a preset second threshold, the segmentation segment is taken as a positive segment, and the weight of the positive segment is Z2And Z1The ratio of (a) to (b).
According to a preferred embodiment of the present invention, when determining the weight of the sliced segment of W, the template vector feature generation unit counts the number of times that the sliced segment of W appears in W and takes the number of times as the weight of the corresponding sliced segment.
According to a preferred embodiment of the present invention, the sorting unit includes: the standard template set selecting unit is used for selecting a standard template set from the candidate demand templates; the training unit is used for training the extracted parameters corresponding to the features by using the standard template set, and taking the parameter values of the templates in the standard template set when the ranking of the templates in all candidate requirement templates cannot be advanced in the training as the weights of the corresponding features; and the calculating and sorting unit is used for calculating the score of the candidate demand template by using the features extracted by the feature extraction unit and the weight of each feature obtained by the training unit and sorting the candidate demand template according to the score.
According to a preferred embodiment of the present invention, the standard template set selecting unit includes: a template set determining unit for sorting the candidate requirement templates based on the feature values for each extracted feature, and respectively taking the top N for each feature3Bit candidate requirement templates as template sets for corresponding features, where N3Is a positive integer; and the intersection unit is used for taking the intersection among the template sets of the features as a standard template set.
According to a preferred embodiment of the present invention, the selecting unit includes: a first selection unit for locating the sequence at the top N4Selecting a candidate demand template of bits as a final demand template, wherein N is4Is a positive integer; a second selection unit for locating at the top M by sorting2Obtaining a keyword set by the boundary words of the candidate demand template of the position, and positioning the sequence at the top N4Selecting a candidate demand template as a final demand template, wherein boundary words in the candidate demand template after the position all belong to the candidate demand template of the keyword set, the boundary words are words which are not generalized in the candidate demand template, the keywords are words synonymous with the boundary words or words with mutual information between the boundary words meeting the requirements, and M is2Is a positive integer and M2Is less than or equal to N4
According to the technical scheme, the invention provides the universal field demand template generation method, and the candidate demand templates can be automatically mined and the characteristics of the candidate demand templates are extracted to evaluate the quality of the candidate demand templates aiming at different fields, so that the high-quality demand templates can be obtained from the candidate demand templates. The high-quality requirement templates of various fields obtained by the invention provide guarantee for the search engine to understand the behavior of the user.
[ description of the drawings ]
FIG. 1 is a schematic flow chart of a method for generating a domain requirement template according to the present invention;
FIG. 2 is a flowchart illustrating an embodiment of obtaining a candidate requirement template according to the present invention;
FIG. 3 is a schematic diagram of the present invention using seed query to obtain search engine return data;
FIG. 4 is a block diagram schematically illustrating the structure of an embodiment of the apparatus for generating a domain requirement template according to the present invention;
FIG. 5 is a block diagram schematically illustrating the structure of an embodiment of the similarity feature extraction unit according to the present invention;
FIG. 6 is a block diagram illustrating the structure of an embodiment of a boundary word feature extraction unit according to the present invention;
FIG. 7 is a block diagram illustrating the structure of an embodiment of a standard template set selection unit according to the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
Referring to fig. 1, fig. 1 is a flow chart illustrating a method for generating a domain requirement template according to the present invention. As shown in fig. 1, the method includes:
step S101: and acquiring a candidate demand template of a specific field.
Step S102: and extracting the characteristics of the candidate demand template.
Step S103: and sorting the candidate demand templates by using the extracted features.
Step S104: and selecting a final demand template from the candidate demand templates according to the sequencing result to serve as the demand template in the specific field.
The above method is described in detail below by way of specific examples.
In the invention, the specific field is a range reflecting the search purpose of the user, such as the public transportation field, the weather field and the like, and the fields reflect the search purpose when the user searches information.
Referring to fig. 2, fig. 2 is a flowchart illustrating an embodiment of obtaining a candidate requirement template according to the present invention. In this embodiment, a domain qualifier dictionary and a slot keyword dictionary are used to process a user search request query in a user search log (querylog), so as to generate a candidate requirement template.
The domain qualifier dictionary contains words related to each domain, wherein the qualifier of a specific domain is a word related to a specific domain, and in this embodiment, the qualifier of a specific domain is used for filtering a query when the query is selected. Only the query containing the qualifier of the specific field can be generalized, and the candidate demand template generated by the generalization belongs to the candidate demand template of the specific field. The words in the domain qualifier dictionary can be collected by the following ways:
firstly, the domain seed words can be mined from the query of the user to be used as domain limiting words, wherein the domain seed words can be configured in a manual mode or labeled in a search log in a manual mode.
And then, searching the synonym dictionary to obtain a word synonymous with the domain seed word as a domain limiting word, and selecting a word with high association degree with the seed word in the search log as the domain limiting word by using mutual information for measuring the closeness degree of the two words. Mutual information between words can be obtained by counting large-scale linguistic data, and is not described herein again as it belongs to the prior art. Taking the public transportation field as an example, table 1 gives an example of a part of field qualifiers:
TABLE 1
Figure BDA0000098048170000091
The process of generating the candidate requirement template is a process of generalizing the query, and the generalization refers to replacing a part, matched with the slot keyword in the specific field, in the query of the user with a wildcard. The slot key is a word used for generalization, and is determined by looking up a slot key dictionary, which can be obtained by collecting various proper nouns.
For example, a query such as "Beijing 15-way bus route" can be generalized to obtain a demand template such as "[ city name ] [ bus route ] bus route". Each "[ ]" symbol represents a slot of the template indicating that the slot can be replaced if wildcard attribute requirements are met, e.g., the template above also matches "shanghai suburb 14-way bus routes".
After the candidate demand templates are obtained, whether the candidate demand templates are subjected to filtering processing can be determined according to the requirement on the number of slots preset in the specific field to which the candidate demand templates belong. For example, in the field of train information query, the variable information in the query generally only relates to a starting point and an end point, so that the number of the template preset slots in the field of train information query can be set to be 2, and any template which does not meet the requirement of the preset slot number can be filtered out, so that the complexity of processing the candidate requirement template is reduced.
In this embodiment, the features extracted in step S102 at least include one of the following features:
the similarity characteristic is used for describing the closeness of the relation between the candidate demand template and the specific field; the generalization capability feature is used for describing the capability of the candidate requirement template covering the search request query of the user; and the boundary word characteristics are used for describing the influence of the non-generalized words in the candidate requirement template on the correctness of the candidate requirement template.
The following describes an embodiment of a calculation method of the above three features.
1. Similarity feature
The similarity characteristic of a candidate demand template W can be obtained by calculating the cosine distance between the core word vector of the candidate demand template W and the core word vector of the specific field to which the candidate demand template W belongs, and specifically, the following formula (1) can be adopted for calculation:
sim_score=CossSimilarity(pattern_vector,seed_query_centroid) (1)
the sim _ score represents a similarity characteristic value of the candidate requirement template W, the pattern _ vector represents a core word vector of the candidate requirement template W, the seed _ query _ center represents a core word vector of a specific field, and the CossSimilarity represents a cosine similarity function.
The core word vector is a vector formed by taking a core word as a vector characteristic. Therefore, when calculating the similarity feature, it is first determined how to select the core word.
When determining the core word of the specific field, the seed query of the specific field can be used to obtain the data returned by the search engine, and the data returned by the search engine is used to determine the core word. Referring to fig. 3, fig. 3 is a schematic diagram illustrating that a seed query is used to obtain data returned by a search engine according to the present invention. As shown in fig. 3, the seed query is "beijing 15-way bus route", and the seed query can obtain a plurality of search results from the search engine. And preprocessing the title (title) and the content (text) of the search results (including sentence segmentation, word segmentation, stop word removal and the like) to obtain a statistical corpus. And counting the number of sentences of the word and the search word which are commonly appeared aiming at each word in the statistical corpus, and counting the number of sentences containing the search word, wherein the search word is obtained by dividing the seed query into words.
After the information is obtained, the weight of each word can be calculated by adopting the following formula (2), and the words with the weight greater than the set threshold value are taken as core words, and the weights of the core words correspondingly form the weight of the corresponding vector characteristics.
<math> <mrow> <mi>Centralit</mi> <msub> <mi>y</mi> <mrow> <mi>sch</mi> <mo>_</mo> <mi>term</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>log</mi> <mrow> <mo>(</mo> <mi>Co</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>sch</mi> <mo>_</mo> <mi>term</mi> <mo>)</mo> </mrow> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> <mrow> <mi>log</mi> <mrow> <mo>(</mo> <mi>sf</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>+</mo> <mi>log</mi> <mrow> <mo>(</mo> <mi>sf</mi> <mrow> <mo>(</mo> <mi>sch</mi> <mo>_</mo> <mi>term</mi> <mo>)</mo> </mrow> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>&times;</mo> <mi>log</mi> <mrow> <mo>(</mo> <mi>idf</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>
Wherein, the center issch_term(w) represents the weight of the word w, and Co (w, sch _ term) represents the number of sentences which commonly appear in the word w and the search word sch _ term; sf (sch _ term) represents the number of sentences containing the search term sch _ term; sf (w) represents the number of sentences containing the word w; idf (w) denotes the inverse of the word wThe document frequency can be obtained by searching an inverse document frequency table obtained by utilizing large-scale corpus statistics.
In obtaining a domain-specific seed query, several embodiments may be used as follows:
the first implementation mode comprises the following steps:
selecting N with the most number of queries covered in a search log from candidate requirement templates contained in a specific field2A candidate requirement template, and for the N2Selecting M with the most query times from the query covered by each candidate demand template1As seed query, N2、M1Is a positive integer, preferably M1Equal to 1. For example, table 2 below is a candidate demand template for the public transportation domain:
TABLE 2
Figure BDA0000098048170000121
Suppose N2=2,M1Table 3 shows the seed query obtained in the first embodiment and the candidate requirement template in table 2, where the candidate requirement template is 1.
TABLE 3
Seed query Corresponding template
Beijing 15-way bus route [ City name][ bus route]Bus route
Beijing public transport 23 routes [ City name]Bus (bus route)]
In this embodiment, the seed query is derived from the user's real query, and can better represent the user's habits.
The second embodiment:
and combining the groove key words of the specific field with the limiting words of the specific field to generate a seed query.
Taking the generation of the seed query in the public transportation field as an example, please refer to table 4:
TABLE 4
Generated seed query Corresponding slot key Corresponding domain qualifiers
Beijing 15-way bus route Beijing 15 Lou Bus route
Shanghai public transport Shanghai province Public transport
In this way, the generated seed query has a simple structure.
Preferably, the seed query can be acquired by adopting the third embodiment.
The third embodiment is as follows:
and selecting part of the seed query by adopting the method of the first embodiment, and replacing the slot keywords in the selected seed query with other slot keywords in a specific field by utilizing the slot keyword dictionary to obtain the expanded seed query.
For example, table 5 shows the seed query obtained in the third embodiment.
TABLE 5
Selected seed query Extended seed query
Beijing 15-way bus route Shenyang 15-way bus route
Beijing public transport 23 routes Bus 12 routes in south China
The above process may obtain the core word vector of the specific field, and the following describes a process of obtaining the core word vector of the candidate requirement template.
First, similar to obtaining a domain-specific core word vector,the statistical corpus needs to be obtained first. When obtaining the statistical corpus, firstly, selecting N with the largest query times from the query covered by the candidate demand template in the search log1The query serves as a query to be searched, then the query to be searched is used for obtaining search results from a search engine, title and text of the search results are preprocessed, and then statistical linguistic data can be obtained, wherein N is1Is a positive integer.
In the obtained statistical corpus, the frequency of each word appearing in the corpus is counted, the weight of each word is calculated according to the following formula (3), the word with the weight larger than a set threshold value can be used as a core word of a candidate demand template, and the weight of the core word is the weight of the corresponding vector feature.
Weight(w)=log(tf(w)+1)×log(idf(w)+1) (3)
Wherein weight (w) represents the weight of word w, tf (w) represents the frequency of word w, and idf (w) represents the inverse document frequency of word w, which can be obtained by searching the inverse document frequency table obtained by large-scale corpus statistics.
After the core word vectors of the candidate demand templates and the core word vectors of the specific field are obtained, the similarity characteristics of the candidate demand templates can be calculated according to the formula (1).
2. Generalization ability characteristics
The generalization ability characteristic can be measured by the number of different slot keyword sequences in the slot keyword sequences corresponding to the candidate demand template, wherein one slot keyword sequence corresponding to the candidate demand template is a sequence consisting of slot keywords in one query covered by the candidate demand template in the search log.
For example, for the template "[ city name ] [ bus route ] bus route", the query covered by the template includes "beijing 15-way bus route", "shanghai suburb 14-way bus route", "shenyang ferri-xi 2-way bus route", "beijing 15-way bus route map query", the slot keyword sequences include "beijing 15-way", "shanghai suburb 14-way", "shenyang ferri-xi 2-way", and "beijing 15-way", and the different slot keyword sequences include "beijing 15-way", "shanghai suburb 14-way", and "shenyang ferri-xi 2-way", so that the template of the generalization capability is 3 for the "[ city name ] [ bus route ] bus route".
Preferably, the generalization ability characteristic is calculated in the following manner. Firstly, determining the number of different slot keyword sequences corresponding to each candidate demand template contained in a specific field and the maximum value in the number, and then calculating the generalization capability characteristic value of each candidate demand template according to the following formula (4):
general_scorei=log(pattern_dif_queryi+1)/log(max_dif_query+1) (4)
wherein general _ scoreiThe generalization capability characteristic value, pattern _ dif _ query, representing the candidate requirement template iiThe number of the different slot key word sequences corresponding to the candidate demand template i is represented, and max _ dif _ query represents the maximum value of the number of the different slot key word sequences corresponding to each candidate demand template contained in the specific field to which the candidate demand template i belongs.
3. Boundary word features
The boundary words are words in the candidate requirement template that are not generalized. The words in the candidate requirement template which are not generalized have influence on the correctness of the finally generated template. For example, in the field of public transportation, the requirement template of 'city name ] [ bus route ] bus route' obviously reflects the requirement of the public transportation field better than the template of 'how to do' city name 'when the bus card is disconnected'.
In the present invention, the boundary word feature of the candidate requirement template W is calculated by the following formula (5).
boundary_word_score
=CosSimilarity(pattern_centroid,positive_centroid) (5)
-CosSimilarity(pattern_centroid,negative_centroid)
The boundary word feature of the candidate requirement template W is boundary word feature, CosSimilarity is cosine similarity function, pattern _ centroid is vector formed by the candidate requirement template W, positive _ centroid is positive vector of specific field, and negative _ centroid is negative vector of specific field.
How to obtain the variable values in the formula is described below.
The process of generating the domain-specific positive and negative vectors includes:
segmenting all candidate requirement templates contained in a specific field according to an n-gram (n-gram) (n is a preset positive integer), preferably, taking n as 2, and obtaining each segmented segment, wherein the n-gram is a combination of n words with minimum granularity capable of performing semantic expression and appearing in sequence. For example, for the template of [ city name ] [ bus route ] bus route ", assuming that the words with the minimum granularity which can be expressed semantically are respectively [ city name ]," [ bus route ] "and" bus route ", the segmentation segments of the 2-gram of the template are respectively [ city name ] [ bus route ]," [ bus route ] bus route ", or for the template of [ city name ]" which is disconnected from the bus card, assuming that the words with the minimum granularity which can be expressed semantically are respectively "bus card", "disconnected from the bus", "which is disconnected from the bus" and "[ city name ]", the segmentation segments of the 2-gram of the template are respectively "bus card disconnected from the bus", "how to do so from the bus name ]".
Selecting positive segments and negative segments from the segmentation segments, wherein one positive segment is a vector feature of the positive vector, and one negative segment is a vector feature of the negative vector, and determining the weight of each vector feature. The process comprises the following steps:
A. determining a slot keyword sequence corresponding to each segmentation segment, wherein one slot keyword sequence of one segmentation segment is a sequence consisting of slot keywords in one query covered by one candidate requirement template of the segmentation segment.
For example, for a segmentation "[ city name ] bus", a candidate requirement template containing the segmentation and a query covered by the template are shown in table 6:
TABLE 6
Figure BDA0000098048170000161
Then, for the segmented segment of [ city name ] public transportation ", the slot keyword sequence includes" beijing 15 routes "," shanghai 36 routes "," beijing 15 routes ", and" hang zhou ".
B. Selecting positive vector characteristics and negative vector characteristics from each segmentation segment and determining the weight of each vector characteristic according to the following modes:
(1) if all the groove keyword sequences of one segmentation segment are the same, the segmentation segment is used as a negative vector feature, and the weight of the negative vector feature is 1.
(2) If all the groove key word sequences of a segmentation segment are not completely identical, but the proportion P of one groove key word sequence in all the groove key word sequences of the segmentation segment is larger than a preset first threshold, the segmentation segment is taken as a negative vector feature, the weight of the vector feature is the proportion P, and preferably, the first threshold is 90%.
(3) Determining the number of different slot keyword sequences corresponding to each candidate demand template contained in the specific field, and obtaining the maximum value Z in the number1If a segmentation segment does not meet the two conditions, the number Z of the different slot keyword sequences of the segmentation segment2And Z1When the ratio of (a) to (b) is greater than a preset second threshold, the segmented segment is taken as a positive vector feature, and the weight of the positive vector feature is Z2And Z1Preferably, the second threshold is 1%.
For example, the segmented segments of the city name public transport, the different slot key word sequences are respectively 'Beijing 15 routes', 'Shanghai 36 routes', 'Hangzhou', the number of the different slot key word sequences is 3, wherein the proportion of Beijing 15 way in all the groove keyword sequences is 2/4, the proportion of Shanghai/36 way in all the groove keyword sequences is 1/4, the proportion of Hangzhou in all the groove keyword sequences is 1/4, therefore, the segmentation fragment does not meet the conditions of (1) or (2), so the segmented segment does not belong to negative vector features, assuming that the maximum value among the number of distinct bin keyword sequences corresponding to each candidate demand template contained in the specific field is 10 and the second threshold value is 1%, then the sliced segment should be treated as a positive vector feature since 3/10 is greater than 1%.
Taking the template shown in table 2 as an example, the positive vector and the negative vector obtained in the above manner are shown in tables 7 and 8, respectively:
TABLE 7
Vector features in positive vectors Feature weights
[ City name][ bus route] 1.000000
[ bus route]Bus route 1.000000
[ City name]Public transport 0.666667
Bus (bus route)] 0.666667
[ Place name]To 0.666667
To [ place name ]] 1.000000
[ Place name]Is/are as follows 0.666667
Bus (2) 0.666667
TABLE 8
Vector features in negative vectors Feature weights
[ Place name]Bus route 1.000000
Public transit monthly ticket 1.000000
Monthly ticket (City name)] 1.000000
Bus card (place name)] 1.000000
[ Place name]Rechargeable point 1.000000
Public transport [ city name] 1.000000
[ City name]Telephone set 1.000000
Public transport [ place name ]] 1.000000
[ Place name]Catching thief 1.000000
Bus is disconnected 1.000000
How to break 1.000000
How to do [ city name ]] 1.000000
The vector features in the vector formed by the candidate requirement template W are the sliced segments of the candidate requirement template W, wherein the slicing manner is similar to that described in the positive and negative vectors, and the feature weight can be determined by the number of times the corresponding sliced segment appears in the candidate requirement template W.
For example, the template of the 'city name ] [ bus route ] bus route' contains the segmentation segments of 'city name ] [ bus route ]' and 'bus route ], and because the times of the two segmentation segments appearing in the template are both 1, the feature weights of the vector features' city name ] [ bus route ] 'and' bus route ] corresponding to the template 'city name ] [ bus route' are both 1. If one template is 'city name', 'bus route', 'city name', 'bus route ]', then the feature weight is 2 for the vector feature 'city name', 'bus route ]' of the template.
The determination method of the feature weight of the vector feature of the candidate demand template is not unique, except that the number of times of occurrence of the segmentation segment in the template is used as the feature weight of the corresponding vector feature, the feature weight of the corresponding vector feature can be determined in a boolean value mode, and the calculation method of the feature weight is not limited herein.
Taking the candidate requirement templates shown in table 2 as an example, the boundary word features of each candidate requirement template are shown in table 9:
TABLE 9
Figure BDA0000098048170000191
In step S103, the sorting process includes:
1. selecting a standard template set from the candidate demand templates, wherein the standard template set comprises the following steps:
sorting the candidate demand templates based on the feature values respectively for each extracted feature, and taking the top N for each feature3Bit candidate requirement templates as template sets for corresponding features, where N3Is a positive integer.
And taking the intersection set among the template sets of the features, and taking the intersection set as a standard template set.
For example: the candidate requirement templates S1-S10 are ranked against features 1, 2, 3, resulting in table 10:
watch 10
Figure BDA0000098048170000201
If N is present3If 5, the template set of feature 1 is S5S 6S 4S 2S 1, the template set of feature 2 is S4S 5S 2S 8S 1, the template set of feature 3 is S2S 10S 5S 6S 1, and the intersection of the template sets of features is S1S 2S 5.
2. And (3) training the extracted parameters corresponding to the features by using the standard template set, and taking the parameter values of the templates in the standard template set when the ranking of the templates in all candidate requirement templates cannot be advanced in the training as the weights of the corresponding features.
Formula (6) is that when all candidate requirement templates are ranked based on all extracted features, the higher the score of each candidate requirement template is, the better the quality of the candidate requirement template is, and therefore the ranking is higher.
total_score=λ1sim_score+λ2general_score+λ3boundary_word_score (6)
Wherein sim _ score, general _ score and boundary _ word _ score are values of similarity feature, generalization capability feature and boundary word feature, lambda1、λ2And lambda3Is the parameter to be trained and represents the weight of each feature.
The training parameters are reduced in gradient, and the values of the parameters are continuously adjusted through continuous iteration so that the ranking of the templates in the standard template set is advanced as far as possible until the ranking of the templates in the standard template set in all candidate required templates is not advanced any more, and at the moment, each parameter value is the weight of the corresponding feature.
3. Calculating the score of the candidate requirement template by using the extracted features and the weight thereof, and sorting the candidate requirement templates according to the score, namely calculating the score of the candidate requirement template by adopting the following formula (6), wherein lambda in the formula (6)1、λ2And lambda3Weights for each feature obtained for training.
The scores of the candidate demand templates are calculated in the above mode, and the candidate demand templates can be sorted in the sequence of the scores from high to low.
When the final requirement template is selected in step S104, except that the sequence is located at the top N4In addition to the candidate requirement template of bits as the final requirement template, the candidate requirement template of bits is also positioned at the top M by using the ordering2The boundary words of the candidate demand template of the bit are positioned at the top N from the ordering4Selecting a final demand template from the candidate demand templates after the bit, wherein M2And N4Are all positive integers and M2≤N4
The specific method comprises the following steps:
using the keyword dictionary, obtaining and ranking the top M2A keyword set corresponding to a boundary word of the candidate demand template, wherein the keyword is a word synonymous with the boundary word or a word satisfying the requirement with the mutual information between the boundary words;
will sort at the top N4And the boundary words in the candidate demand templates after the position all belong to the candidate demand templates of the keyword set to serve as final demand templates.
Assume ranking top M2The templates within the bit are: [ City name][ bus route]Bus route, [ place name]To [ place name ]]Bus and city name]Bus (bus route)]The boundary words comprise bus routes, arrival, buses and the like, and the keyword set corresponding to the boundary words can be acquired as bus/business transaction/based on the keyword dictionaryFor a bus/public transport line/bus combination/bus line/bus/city bus/bus line/arrival ", then for the first N ranked buses/arrival ″, the rank is the highest4Template after bit to place name]For the bus route ", since the boundary words" to "and" bus route "of the template are both in the keyword set, the template can also be selected as the final template. The keywords in the keyword dictionary can be obtained by various existing technologies, such as mining synonyms or mutual information calculation, and will not be described in detail here.
Referring to fig. 4, fig. 4 is a block diagram illustrating a structure of an apparatus for generating a domain template according to an embodiment of the present invention. As shown in fig. 4, the apparatus includes: a candidate requirement template obtaining unit 201, a feature extracting unit 202, a sorting unit 203 and a selecting unit 204.
Wherein the candidate requirement template obtaining unit 201 is used for obtaining a candidate requirement template of a specific field. Preferably, the candidate requirement template obtaining unit 201 includes a definition unit 2011 and a generalization unit 2012.
The qualifier 2011 is configured to select a query matching a preset domain-specific qualifier in the user search request query from the search log, where the domain-specific qualifier is a term related to a specific domain. The generalization unit 2012 is configured to replace a part of the selected query, which matches a preset slot keyword in the specific field, with a wildcard character to obtain a candidate requirement template, where the slot keyword in the specific field is a word used for generalization in the specific field.
Further, the candidate requirement template obtaining unit 201 may further include a filtering unit, configured to filter, according to a preset requirement for the number of slots in the specific field, a candidate requirement template that does not meet the requirement for the number of slots from the candidate requirement templates obtained by the generalization unit.
The feature extraction unit 202 is configured to extract features of the candidate requirement template. Preferably, the feature extraction unit 202 includes at least one of a similarity feature extraction unit 2021, a generalization capability feature extraction unit 2022, and a boundary word feature extraction unit 2023.
The similarity feature extraction unit 2021 is configured to extract a similarity feature of the candidate requirement template, where the similarity feature is used to describe a closeness of a connection between the candidate requirement template and a specific field. Referring to fig. 5, fig. 5 is a block diagram illustrating a structure of a similarity feature extraction unit according to an embodiment of the present invention. As shown in fig. 5, the similarity feature extraction unit 2021 includes a template word vector generation unit 2021_1, a domain word vector generation unit 2021_2, and a calculation unit 2021_ 3.
The template word vector generating unit 2021_1 is configured to obtain a core word vector of W when extracting a similarity feature of the candidate requirement template W.
The domain word vector generating unit 2021_2 is configured to obtain a core word vector of a specific domain.
The calculating unit 2021_2 is configured to calculate a similarity between the core word vector of the candidate requirement template and the core word vector of the specific field, and use the similarity as a similarity feature of W.
Preferably, when acquiring the core word vector of W, the template word vector generating unit 2021_1 selects N with the largest number of queries from the queries covered by W in the search log1A query, and in this N1Determining the core words and the weights of the core words from the search results returned by the search engine by the query to form a core word vector of W, wherein N is the number of the core words1Is any positive integer.
The manner of obtaining the seed query of the specific domain by the domain word vector generation unit 2021_2 includes:
the method comprises the steps of selecting N with the most covering query numbers in a search log from all candidate requirement templates contained in a specific field2A candidate requirement template, and for the N2Selecting M with the most query times from the query covered by each candidate demand template1Query as seed query, wherein N2And M1Is a positive integer.
And secondly, combining a preset groove keyword of the specific field with a preset limiting word of the specific field to generate the seed query of the specific field.
After part of the seed query is selected by using the first mode, replacing the slot keywords in the seed query selected by the first mode with other slot keywords in a slot keyword dictionary by using a preset slot keyword dictionary in a specific field to obtain an expanded seed query; the partial seed query and the extended seed query constitute a domain-specific seed query.
Preferably, the domain word vector generation unit 2021_2 may obtain the domain-specific seed query in a manner of three.
Please continue to refer to fig. 4. A generalization capability feature extraction unit 2022, configured to extract a generalization capability feature of the candidate requirement template. The generalization capability feature is used for describing the capability of the candidate requirement template covering the search request query of the user.
Preferably, when extracting the generalization capability feature of the candidate requirement template W, the generalization capability feature extraction unit 2022 determines a slot keyword sequence corresponding to W, counts the number of different slot keyword sequences in the slot keyword sequence corresponding to W, and calculates the generalization capability feature of W according to the number, where one slot keyword sequence corresponding to W is a sequence composed of slot keywords in one query covered by W in the search log.
The boundary word feature extraction unit 2023 is configured to extract boundary word features of the candidate requirement templates. The boundary word features are used for describing the influence of the non-generalized words in the candidate requirement template on the correctness of the candidate requirement template.
Referring to fig. 6, fig. 6 is a block diagram illustrating a structure of a boundary word feature extraction unit according to an embodiment of the present invention. As shown in fig. 6, this embodiment includes: a slicing unit 2023_1, a positive/negative vector generating unit 2023_2, a template vector generating unit 2023_3, and a similarity calculating unit 2023_ 4.
Wherein the slicing unit 2023_1 is used for slicing all candidate requirement templates contained in the specific domain into segments.
The positive-negative vector generation unit 2023_2 is configured to select a positive segment from the sliced segments obtained by the slicing unit 2023_1 and determine a weight of the positive segment to generate a positive vector of the specific field, and select a negative segment from the sliced segments obtained and determine a weight of the negative segment to generate a negative vector of the specific field. Preferably, the positive/negative vector generation unit 2023_3 includes a slot keyword sequence determination unit 2023_21 and a positive/negative segment selection unit 2023_ 22.
The slot sequence word determining unit 2023_21 is configured to determine a slot keyword sequence corresponding to each sliced segment, where a slot keyword sequence corresponding to one sliced segment is a sequence including slot keywords in a query covered by a candidate requirement template of the sliced segment.
The positive and negative segment selecting unit 2023_22 is configured to select a positive segment and a negative segment from the sliced segments and determine weights of the positive segment and the negative segment in the following manner:
(1) if all the groove keyword sequences corresponding to one segmentation segment are the same, taking the segmentation segment as a negative segment, wherein the weight of the negative segment is 1;
(2) if all the slot keyword sequences corresponding to one segmentation segment are not completely the same, but the proportion P of one slot keyword sequence in all the slot keyword sequences of the segmentation segment is larger than a preset first threshold value, taking the segmentation segment as a negative segment, and taking the weight of the negative segment as the proportion P;
(3) determining the number of different slot keyword sequences corresponding to each candidate demand template contained in the specific field, and obtaining the maximum value Z in the number1If a segmentation segment does not satisfy the conditions in T1 and T2, the number Z of the different slot keyword sequences corresponding to the segmentation segment2And said Z1If the ratio of the positive segment to the negative segment is greater than a preset second threshold, the segmentation segment is taken as a positive segment, and the weight of the positive segment is Z2And Z1The ratio of (a) to (b).
The template vector generation unit 2023_3 is configured to, when extracting the boundary word feature of the candidate requirement template W, determine a weight of the sliced segment of W and construct a vector of W using the sliced segment of W and the weight of the sliced segment. Preferably, the template vector generation unit 2023_3, when determining the weight of the sliced piece of W, counts the number of times the sliced piece of W appears in W, and takes the number of times as the weight of the corresponding sliced piece.
The similarity calculation unit 2023_4 is used for calculating the similarity S between the vector of W and the positive vector1And the similarity S of the vector of W and the negative vector2And according to S1And S2The difference of (d) yields the boundary word feature of W.
Please continue to refer to fig. 4. The sorting unit 203 is configured to sort the candidate requirement templates by using the features extracted by the feature extraction unit 202. The sorting unit 203 includes a standard template set selecting unit 2031, a training unit 2032, and a calculating and sorting unit 2033.
The standard template set selecting unit 2031 is configured to select a standard template set from the candidate required templates. Referring to fig. 7, fig. 7 is a block diagram illustrating a standard template set selecting unit according to an embodiment of the present invention. As shown in fig. 7, the standard template set selecting unit 2031 includes a template set determining unit 2031_1 and an intersection unit 2031_ 2. The template set determining unit 2031_1 is configured to rank, based on the feature value, each candidate requirement template for each extracted feature, and rank, N, top N for each feature3Bit candidate requirement templates as template sets for corresponding features, where N3Is a positive integer. The intersection unit 2031 — 2 is configured to take an intersection between template sets of the features as a standard template set.
Please continue to refer to fig. 4. The training unit 2032 is configured to train the extracted parameters corresponding to the features using the standard template set, and use, as the weights of the corresponding features, the parameter values of the templates in the standard template set when the ranking of the templates in all candidate requirement templates cannot be advanced in the training.
The calculating and ranking unit 2033 is configured to calculate scores of the candidate requirement templates using the features extracted by the feature extraction unit 202 and the weights of the features obtained by the training unit 2032, and rank the candidate requirement templates according to the scores. Preferably, each candidate demand template is ranked by score from high to low.
The selecting unit 204 is configured to select a final requirement template from the candidate requirement templates as a requirement template in the specific field according to the sorting result of the sorting unit 203. Preferably, the selecting unit 204 includes a first selecting unit 2041 and a second selecting unit 2042. Wherein the first selecting unit 2041 is used for locating the sequence at the top N4Selecting a candidate demand template of bits as a final demand template, wherein N is4Is a positive integer. The second selecting unit 2042 is used for locating at the top M by using the sorting2Obtaining a keyword set by the boundary words of the candidate demand template of the position, and positioning the sequence at the top N4Selecting a candidate demand template as a final demand template, wherein boundary words in the candidate demand template after the position all belong to a candidate demand template of a keyword set, the boundary words are words which are not generalized in the candidate demand template, the keywords are words synonymous with the boundary words or words with mutual information between the boundary words meeting requirements, and M is2Is a positive integer and M2Is less than or equal to N4
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (28)

1. A method for generating a domain requirements template, the method comprising:
A. acquiring a candidate demand template in a specific field;
B. extracting features of the candidate requirement template, wherein the features at least comprise: at least one of a similarity characteristic for representing the closeness between the candidate demand template and the specific field, a generalization capability characteristic for representing the capability of the candidate demand template covering the query of the user search request, and a boundary word characteristic for representing the influence of the non-generalization words in the candidate demand template on the correctness of the candidate demand template;
C. sorting the candidate demand templates by using the extracted features;
D. and selecting the final demand template from the candidate demand templates according to the sequencing result to serve as the demand template in the specific field.
2. The method of claim 1, wherein step a comprises:
A1. selecting a query matched with a preset limiting word in the specific field from user queries from the search logs;
A2. and replacing the part matched with the preset slot keyword in the specific field in the selected query with a wildcard character to obtain a candidate demand template.
3. The method of claim 2, further comprising, after said step a 2: and according to the preset requirement on the number of the slots in the specific field, filtering out a candidate requirement template which does not meet the requirement on the number of the slots from the candidate requirement templates obtained in the step A2.
4. The method of claim 1, wherein the step of extracting similarity features of the candidate requirement templates W comprises:
acquiring the core word vector of the W and the core word vector of the specific field;
and calculating the similarity between the core word vector of the W and the core word vector of the specific field, and taking the similarity as the similarity characteristic of the W.
5. The method of claim 4, wherein the step of obtaining the core word vector of W comprises:
selecting N with the most query times from the query covered by the W in the search log1Query and in said N1Determining the core words and the weights of the core words from the search results returned by the search engine by the query to form the core wordsCore word vector of W, where N1Is a positive integer.
6. The method of claim 4, wherein the step of obtaining the domain-specific core word vector comprises:
and acquiring a search result returned by a search engine by using the seed query in the specific field, and determining the core words and the weights of the core words in the search result to form the core word vector in the specific field.
7. The method of claim 6, wherein the obtaining manner of the domain-specific seed query comprises:
the method I is characterized in that N with the most covering query numbers in a search log is selected from all candidate requirement templates contained in the specific field2A candidate requirement template, and for said N2Selecting M with the most query times from the query covered by each candidate demand template1As seed query, N2And M1Is a positive integer; or,
combining a preset groove keyword of the specific field with a preset limiting word of the specific field to generate a seed query of the specific field; or,
after part of the seed query is selected by using the first mode, replacing the slot keywords in the seed query selected by the first mode with other slot keywords in the slot keyword dictionary by using a preset slot keyword dictionary in the specific field to obtain an expanded seed query; the partial seed query and the extended seed query constitute the domain-specific seed query.
8. The method of claim 1, wherein the step of extracting generalization capability features of the candidate requirement template W comprises:
determining the groove keyword sequence corresponding to the W, counting the number of the groove keyword sequences different from each other in the groove keyword sequence corresponding to the W, and calculating the generalization ability characteristic of the W according to the number, wherein one groove keyword sequence corresponding to the W is a sequence formed by groove keywords in one query covered by the W in a search log.
9. The method of claim 1, wherein the step of extracting the boundary word feature of the candidate requirement template W comprises:
segmenting all candidate requirement templates contained in the specific field into segments, selecting positive segments from the obtained segmented segments and determining the weight of each positive segment to generate a positive vector of the specific field, and selecting negative segments from the obtained segmented segments and determining the weight of each negative segment to generate a negative vector of the specific field;
determining the weight of the segmentation segment of the W and using the segmentation segment of the W and the weight of the segmentation segment to form a vector of the W;
calculating the similarity S between the vector of W and the positive vector1And the similarity S of said W and said negative vector2And according to said S1And said S2The difference value of W is obtained as the boundary word feature of W.
10. The method according to claim 9, wherein the domain-specific positive and negative vectors are generated by a process comprising:
determining a slot keyword sequence corresponding to each segmentation segment, wherein one slot keyword sequence corresponding to one segmentation segment is a sequence consisting of slot keywords in one query covered by one candidate requirement template of the segmentation segment;
t1, if all the groove keyword sequences corresponding to one segmentation segment are the same, taking the segmentation segment as a negative segment, wherein the weight of the negative segment is 1;
t2, if all the slot key word sequences corresponding to one segmentation segment are not identical, but the proportion P of one slot key word sequence in all the slot key word sequences of the segmentation segment is larger than a preset first threshold value, taking the segmentation segment as a negative segment, and taking the weight of the negative segment as the proportion P;
t3, determining the number of the different slot keyword sequences corresponding to each candidate demand template contained in the specific field, and obtaining the maximum value Z in the number1If a segmentation segment does not satisfy the conditions in T1 and T2, the number Z of the different slot keyword sequences corresponding to the segmentation segment2And said Z1If the ratio of the positive segment to the negative segment is greater than a preset second threshold, the segmentation segment is taken as a positive segment, and the weight of the positive segment is Z2And Z1The ratio of (a) to (b).
11. The method of claim 9, wherein the step of determining the weight of the sliced segment of W comprises:
and counting the occurrence times of the segmentation segments of the W in the W and taking the times as the weight of the corresponding segmentation segments.
12. The method of claim 1, wherein step C comprises:
selecting a standard template set from the candidate demand templates;
using parameters corresponding to each feature extracted by the standard template set training, and taking parameter values of templates in the standard template set when the ranking of the templates in all candidate requirement templates cannot be advanced in the training as weights of the corresponding features;
and calculating the score of the candidate demand templates by using the extracted features and the weight of the features, and sequencing the candidate demand templates according to the score.
13. The method of claim 12, wherein the step of selecting a set of standard templates from the candidate requirement templates comprises:
sorting the candidate demand templates based on the feature values respectively for each extracted feature, and taking the top N for each feature3Candidate requirement template of bits as pairSet of templates to be characterized, wherein N3Is a positive integer;
and taking the intersection set among the template sets of the features as a standard template set.
14. The method of claim 1, wherein step D comprises:
will sort at the top N4Selecting a candidate demand template of bits as a final demand template, wherein N is4Is a positive integer;
at top M with ordering2Obtaining a keyword set by the boundary words of the candidate demand template of the position, and positioning the sequence at the top N4Selecting a candidate demand template as a final demand template, wherein boundary words in the candidate demand template after the position all belong to the candidate demand template of the keyword set, the boundary words are words which are not generalized in the candidate demand template, the keywords are words synonymous with the boundary words or words with mutual information between the boundary words meeting the requirements, and M is2Is a positive integer and M2Is less than or equal to N4
15. An apparatus for generating a domain requirements template, the apparatus comprising:
a candidate template obtaining unit, configured to obtain a candidate demand template in a specific field;
the feature extraction unit is used for extracting features of the candidate demand template, wherein the feature extraction unit at least comprises one of a similarity feature extraction unit, a generalization capability feature extraction unit or a boundary word feature extraction unit, the similarity feature extraction unit is used for extracting similarity features representing the closeness between the candidate demand template and the specific field, the generalization capability feature extraction unit is used for extracting generalization capability features representing the ability of the candidate demand template covering the search request query of the user, and the boundary word feature extraction unit is used for extracting boundary word features representing the influence of non-generalized words in the candidate demand template on the correctness of the candidate demand template;
the sorting unit is used for sorting the candidate demand templates by using the features extracted by the feature extraction unit;
and the selecting unit is used for selecting the final requirement template from the candidate requirement templates as the requirement template in the specific field according to the sorting result of the sorting unit.
16. The apparatus of claim 15, wherein the candidate template retrieving unit comprises:
the restriction unit is used for selecting the query matched with the preset restriction words in the specific field from the user queries from the search logs;
and the generalization unit is used for replacing the part matched with the preset slot keyword in the specific field in the query selected by the limitation unit with a wildcard character to obtain a candidate demand template.
17. The apparatus according to claim 16, wherein the candidate template obtaining unit further includes a filtering unit, configured to filter, according to a preset slot number requirement for the specific field, a candidate requirement template that does not meet the slot number requirement from the candidate requirement templates obtained by the generalization unit.
18. The apparatus according to claim 15, wherein the similarity extracting unit includes:
the template word vector generating unit is used for acquiring a core word vector of the candidate requirement template W when the similarity characteristic of the W is extracted;
the domain word vector generating unit is used for acquiring a core word vector of the specific domain;
and the calculating unit is used for calculating the similarity between the core word vector of the W and the core word vector of the specific field, and taking the similarity as the similarity characteristic of the W.
19. The apparatus of claim 18, wherein the template word vector generation unit is configured to overwrite a search log with the WSelecting N with the most query times in the query of the cover1Query and in said N1Determining core words and weights of the core words from search results returned by a search engine by the query to form the core word vector of W, wherein N is the number of the core words1Is a positive integer.
20. The apparatus of claim 18, wherein the domain word vector generating unit obtains the search result returned by the search engine by using the domain-specific seed query, and determines the core words and the weights of the core words in the search result to form the domain-specific core word vector.
21. The apparatus of claim 20, wherein the manner of obtaining the domain-specific seed query by the domain-word-vector generating unit comprises:
the method I is characterized in that N with the most covering query numbers in a search log is selected from all candidate requirement templates contained in the specific field2A candidate requirement template, and for said N2Selecting M with the most query times from the query covered by each candidate demand template1As seed query, N2And M1Is a positive integer; or,
combining a preset groove keyword of the specific field with a preset limiting word of the specific field to generate a seed query of the specific field; or,
after part of the seed query is selected by using the first mode, replacing the slot keywords in the seed query selected by the first mode with other slot keywords in the slot keyword dictionary by using a preset slot keyword dictionary in the specific field to obtain an expanded seed query; the partial seed query and the extended seed query constitute the domain-specific seed query.
22. The apparatus according to claim 15, wherein the generalization capability feature extraction unit determines the slot keyword sequence corresponding to W when extracting the generalization capability feature of the candidate requirement template W, counts the number of different slot keyword sequences in the slot keyword sequence corresponding to W, and calculates the generalization capability feature of W according to the number, wherein one slot keyword sequence of W is a sequence consisting of slot keywords in one query covered by W in a search log.
23. The apparatus of claim 15, wherein the boundary word feature extraction unit comprises:
the segmentation unit is used for segmenting all candidate requirement templates contained in the specific field into segments;
the positive and negative vector generating unit is used for selecting positive segments from all the segmentation segments obtained by the segmentation unit and determining the weight of the positive segments to generate a positive vector of the specific field, and selecting negative segments from all the segmentation segments and determining the weight of each negative segment to generate a negative vector of the specific field;
the template vector generating unit is used for determining the weight of the segmentation segment of the W and forming the vector of the W by using the segmentation segment of the W and the weight of the segmentation segment when the boundary word feature of the candidate requirement template W is extracted;
a similarity calculation unit for calculating a similarity S between the vector of W and the positive vector1And the similarity S of the vector of W and the negative vector2And according to said S1And said S2The difference value of W is obtained as the boundary word feature of W.
24. The apparatus of claim 23, wherein the positive-negative vector generation unit comprises:
the system comprises a slot keyword sequence determining unit, a slot keyword sequence determining unit and a slot keyword sequence determining unit, wherein the slot keyword sequence corresponding to each segmentation segment is a sequence formed by slot keywords in a query covered by a candidate requirement template containing the segmentation segment;
the positive and negative segment selecting unit is used for selecting a positive segment and a negative segment from the segmentation segments and determining the weight of the positive segment and the negative segment according to the following modes:
t1, if all the groove keyword sequences corresponding to one segmentation segment are the same, taking the segmentation segment as a negative segment, wherein the weight of the negative segment is 1;
t2, if all the slot key word sequences corresponding to one segmentation segment are not identical, but the proportion P of one slot key word sequence in all the slot key word sequences of the segmentation segment is larger than a preset first threshold value, taking the segmentation segment as a negative segment, and taking the weight of the negative segment as the proportion P;
t3, determining the number of the different slot keyword sequences corresponding to each candidate demand template contained in the specific field, and obtaining the maximum value Z in the number1If a segmentation segment does not satisfy the conditions in T1 and T2, the number Z of the different slot keyword sequences corresponding to the segmentation segment2And said Z1If the ratio of the positive segment to the negative segment is greater than a preset second threshold, the segmentation segment is taken as a positive segment, and the weight of the positive segment is Z2And Z1The ratio of (a) to (b).
25. The apparatus according to claim 23, wherein the template vector feature generation unit, when determining the weight of the sliced piece of W, counts the number of times the sliced piece of W appears in W and takes the number of times as the weight of the corresponding sliced piece.
26. The apparatus of claim 15, wherein the sorting unit comprises:
the standard template set selecting unit is used for selecting a standard template set from the candidate demand templates;
the training unit is used for training the extracted parameters corresponding to the features by using the standard template set, and taking the parameter values of the templates in the standard template set when the ranking of the templates in all candidate requirement templates cannot be advanced in the training as the weights of the corresponding features;
and the calculating and sorting unit is used for calculating the score of the candidate demand template by using the features extracted by the feature extraction unit and the weight of each feature obtained by the training unit and sorting the candidate demand template according to the score.
27. The apparatus of claim 26, wherein the standard template set selection unit comprises:
a template set determining unit for sorting the candidate requirement templates based on the feature values for each extracted feature, and respectively taking the top N for each feature3Bit candidate requirement templates as template sets for corresponding features, where N3Is a positive integer;
and the intersection unit is used for taking the intersection among the template sets of the features as a standard template set.
28. The apparatus of claim 15, wherein the selecting unit comprises:
a first selection unit for locating the sequence at the top N4Selecting a candidate demand template of bits as a final demand template, wherein N is4Is a positive integer;
a second selection unit for locating at the top M by sorting2Obtaining a keyword set by the boundary words of the candidate demand template of the position, and positioning the sequence at the top N4Selecting a candidate demand template as a final demand template, wherein boundary words in the candidate demand template after the position all belong to the candidate demand template of the keyword set, the boundary words are words which are not generalized in the candidate demand template, the keywords are words synonymous with the boundary words or words with mutual information between the boundary words meeting the requirements, and M is2Is a positive integer and M2Is less than or equal to N4
CN201110308830.7A 2011-10-12 A kind of method generating domain requirement masterplate and device thereof Active CN102368260B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110308830.7A CN102368260B (en) 2011-10-12 A kind of method generating domain requirement masterplate and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110308830.7A CN102368260B (en) 2011-10-12 A kind of method generating domain requirement masterplate and device thereof

Publications (2)

Publication Number Publication Date
CN102368260A true CN102368260A (en) 2012-03-07
CN102368260B CN102368260B (en) 2016-12-14

Family

ID=

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136221A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method capable of generating requirement template and requirement identification method and device
CN103823809A (en) * 2012-11-16 2014-05-28 百度在线网络技术(北京)有限公司 Query phrase classification method and device, and classification optimization method and device
CN105183721A (en) * 2015-08-13 2015-12-23 小米科技有限责任公司 Template construction method, and information extraction method and device
CN106971728A (en) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 A kind of quick identification vocal print method and system
CN107480139A (en) * 2017-08-16 2017-12-15 深圳市空谷幽兰人工智能科技有限公司 The bulk composition extracting method and device of medical field
CN108228637A (en) * 2016-12-21 2018-06-29 中国电信股份有限公司 Natural language client auto-answer method and system
WO2020019565A1 (en) * 2018-07-27 2020-01-30 天津字节跳动科技有限公司 Search sorting method and apparatus, and electronic device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6516312B1 (en) * 2000-04-04 2003-02-04 International Business Machine Corporation System and method for dynamically associating keywords with domain-specific search engine queries
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN101216853A (en) * 2008-01-11 2008-07-09 孟小峰 Intelligent web enquiry interface system and its method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6516312B1 (en) * 2000-04-04 2003-02-04 International Business Machine Corporation System and method for dynamically associating keywords with domain-specific search engine queries
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN101216853A (en) * 2008-01-11 2008-07-09 孟小峰 Intelligent web enquiry interface system and its method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘亮亮等: "基于查询模板的特定领域中文问答系统的研究与实现", 《江苏科技大学学报(自然科学版)》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136221A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method capable of generating requirement template and requirement identification method and device
CN103823809A (en) * 2012-11-16 2014-05-28 百度在线网络技术(北京)有限公司 Query phrase classification method and device, and classification optimization method and device
CN103823809B (en) * 2012-11-16 2018-06-08 百度在线网络技术(北京)有限公司 A kind of method, the method for Classified optimization and its device to query phrase classification
CN105183721A (en) * 2015-08-13 2015-12-23 小米科技有限责任公司 Template construction method, and information extraction method and device
CN106971728A (en) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 A kind of quick identification vocal print method and system
CN108228637A (en) * 2016-12-21 2018-06-29 中国电信股份有限公司 Natural language client auto-answer method and system
CN108228637B (en) * 2016-12-21 2020-09-04 中国电信股份有限公司 Automatic response method and system for natural language client
CN107480139A (en) * 2017-08-16 2017-12-15 深圳市空谷幽兰人工智能科技有限公司 The bulk composition extracting method and device of medical field
WO2020019565A1 (en) * 2018-07-27 2020-01-30 天津字节跳动科技有限公司 Search sorting method and apparatus, and electronic device and storage medium
US11481402B2 (en) 2018-07-27 2022-10-25 Tianjin Bytedance Technology Co., Ltd. Search ranking method and apparatus, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN109800284B (en) Task-oriented unstructured information intelligent question-answering system construction method
CN103678576B (en) The text retrieval system analyzed based on dynamic semantics
CN103838833B (en) Text retrieval system based on correlation word semantic analysis
CN103136352B (en) Text retrieval system based on double-deck semantic analysis
KR101173561B1 (en) Question type and domain identifying apparatus and method
CN103473283B (en) Method for matching textual cases
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN103425687A (en) Retrieval method and system based on queries
CN103365910B (en) Method and system for information retrieval
CN108920599B (en) Question-answering system answer accurate positioning and extraction method based on knowledge ontology base
CN102156711B (en) Cloud storage based power full text retrieval method and system
CN102411621A (en) Chinese inquiry oriented multi-document automatic abstraction method based on cloud mode
CN105528411B (en) Apparel interactive electronic technical manual full-text search device and method
CN109597895B (en) Knowledge graph-based official document searching method
Yin et al. Facto: a fact lookup engine based on web tables
CN110390094B (en) Method, electronic device and computer program product for classifying documents
CN107844493B (en) File association method and system
CN106484797A (en) Accident summary abstracting method based on sparse study
CN112307182A (en) Question-answering system-based pseudo-correlation feedback extended query method
CN104298715A (en) TF-IDF based multiple-index result merging and sequencing method
CN112036178A (en) Distribution network entity related semantic search method
CN109815401A (en) A kind of name disambiguation method applied to Web people search
CN113761890A (en) BERT context sensing-based multi-level semantic information retrieval method
CN102799586B (en) A kind of escape degree defining method for search results ranking and device
CN109145161A (en) Chinese Place Names querying method, device and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant