New! View global litigation for patent families

CN102368260A - Method and device of producing domain required template - Google Patents

Method and device of producing domain required template Download PDF

Info

Publication number
CN102368260A
CN102368260A CN 201110308830 CN201110308830A CN102368260A CN 102368260 A CN102368260 A CN 102368260A CN 201110308830 CN201110308830 CN 201110308830 CN 201110308830 A CN201110308830 A CN 201110308830A CN 102368260 A CN102368260 A CN 102368260A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
required
template
domain
candidate
templates
Prior art date
Application number
CN 201110308830
Other languages
Chinese (zh)
Other versions
CN102368260B (en )
Inventor
时迎超
柴春光
黄际洲
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

The invention provides a method and a device of producing a domain required template, wherein the method comprises the following steps of: A, obtaining candidate required templates of a special domain; B, extracting the characteristics of the candidate required templates; C, sorting the candidate required templates according to the extracted characteristics; and D, selecting the final required template as the template required in the special domain from the candidate required templates. With above mode, a universal method for producing the high-quality domain required template is realized, which provides a guarantee for a search engine to understand the purpose of acts of users.

Description

一种生成领域需求模版的方法及其装置 A method and apparatus for generating a demand for a stencil art

【技术领域】 TECHNICAL FIELD

[0001] 本发明涉及自然语言处理技术,特别涉及一种生成领域需求模版的方法及其装置。 [0001] The present invention relates to natural language processing, and more particularly to a method and apparatus for generating a template field needs.

【背景技术】 【Background technique】

[0002] 搜索引擎为人们找到所需信息提供了极大的便利。 [0002] Search engines provide a great convenience for people to find the information. 在传统的搜索引擎为用户提供信息的方式中,是通过查找包含用户搜索关键字的索引,为用户返回与关键字匹配的相关页面来实现的。 In the traditional search engines to provide users with information, is achieved by finding the index contains a user searches for a keyword, the user returns that match the keywords related to the page. 例如,用户的搜索请求(query)为“北京汽车4S店招聘销售主管”,这时会得到招聘网站的搜索结果页面,用户可以通过点击该页面进入招聘网站,然后在该招聘网站内填写相关信息并在站内进行检索,得到自己真正需要的信息。 For example, a user's search request (query) is "Beijing auto 4S shop sales recruitment director", then get the search results page recruitment site, users can enter job sites by clicking on the page and fill in the relevant information in the recruitment website and retrieved in the station, get the information they really need. 如果搜索引擎能够更好地理解用户在检索时的真正目的,那么搜索引擎就能够更准确地向用户返回真正符合其需求的信息。 If the search engines can better understand the true purpose for which users search, the search engine will be able to more accurately meet their needs real return information to the user. 因此,自然语言处理对搜索引擎而言非常重要。 Therefore, natural language processing is very important for search engines. 在自然语言处理中,可以采用基于领域的需求模版对用户的搜索目的进行识别。 In natural language processing, the user can search for the purpose of identifying the needs of the art based template. 例如,用户的query为“大钟寺到西单怎么走”,如果该query与交通领域的需求模版相匹配,就可以得知该用户有交通领域的需求,因此可以直接向该用户返回与交通领域相关的应用。 For example, the user's query is "Dazhongsi Xidan how to walk", if the query template needs to match with the transport sector, it is possible that the user needs in the field of transportation, it can be returned directly to the user and the transport sector related applications. 可见,是否能够产生高质量的领域需求模版,对搜索引擎正确理解用户的搜索意图而言,非常重要。 Be seen whether demand can produce high-quality template field, the search engine to correctly understand the user's search intent, it is very important.

[0003] 在以往生成领域需求模版时,针对不同的应用,通常采用不同的挖掘方法进行,这不仅浪费了大量的人力物力,而且这种生成领域需求模版的方法,适应性差,难以随着应用的变化而做出相应的改变。 [0003] In the past, when demand is generated field templates, for different applications, usually using different mining methods, which not only waste a lot of manpower and resources, and this demand is generated field templates methods, poor adaptability, it is difficult with the application the changes make the appropriate changes.

【发明内容】 [SUMMARY]

[0004] 本发明所要解决的技术问题是提供一种生成领域需求模版的方法及装置,以解决采用现有技术生成的领域需求模版适应性差的缺陷。 [0004] The present invention solves the technical problem is to provide a template to generate the art needs a method and apparatus to address the needs of the stencil using adaptive differential field generated defect prior art.

[0005] 本发明为解决技术问题而采用的技术方案是提供一种生成领域需求模版的方法, 包括:A.获取特定领域的候选需求模版;B.提取候选需求模版的特征,所述特征至少包括: 表征候选需求模板与所述特定领域之间紧密度的相似度特征、表征候选需求模板覆盖用户搜索请求query能力的泛化能力特征以及表征候选需求模板中未泛化的词语对候选需求模板正确性所产生影响的边界词特征中的至少一种;C.利用提取的特征对候选需求模版进行排序;D.根据排序的结果从候选需求模版中选择最终需求模版作为特定领域的需求模版。 [0005] aspect of the present invention is employed to solve the technical problem is to provide a field generating demand stencil, comprising:.. A candidate obtaining specific needs stencil art; B extracts candidate template demand feature, wherein said at least comprising: a similarity between the template characterize the specific needs of the candidate field and tightness, characterizing the candidate search request user demand template overlay feature generalization capability query words, and characterization of the candidate template needs not generalized template candidate needs wherein at least one word boundary in impact accuracy;. C using the extracted feature template needs to sort the candidate;. D selected as the final demand template according to the specific field of demand template needs to sort the results from the candidate templates.

[0006] 根据本发明之一优选实施例,所述步骤A包括:A1.从搜索日志中选取用户query 中与预设的所述特定领域的限定词匹配的query ;A2.将选取的query中与预设的所述特定领域的槽关键词匹配的部分替换为通配符,得到候选需求模版。 [0006] According to one embodiment of the present invention, preferably, the step A comprises:.. A1 query the user in selecting the particular field with the preset qualifier matches the search query log; A2 will be selected in the query with a preset specific area of ​​the groove section with a wildcard matching keywords, the candidate needs to obtain the template.

[0007] 根据本发明之一优选实施例,在所述步骤A2之后还包括:根据预设的对所述特定领域的槽位数量要求,从所述步骤A2得到的候选需求模版中过滤掉不满足槽位数量要求的候选需求模版。 [0007] According to one embodiment of the present invention, preferably, after the step A2 further comprises: according to the number of the specific areas of the predetermined slot claim stencil demand from the candidate obtained in the step A2 filter out the number of candidates to meet the needs of template slots requirements. [0008] 根据本发明之一优选实施例,提取候选需求模版W的相似度特征的步骤包括:获取所述W的核心词向量及所述特定领域的核心词向量;计算所述W的核心词向量与所述特定领域的核心词向量之间的相似度,并将该相似度作为所述W的相似度特征。 Step [0008] According to one embodiment of the present invention, preferably, the similarity extracting feature candidate template needs W comprises: obtaining the core word vector W and the vector of the specific areas the core word; calculating the core words of W similarity between the vector and the particular core word field vector, and the similarity as the similarity of the characteristic of the W.

[0009] 根据本发明之一优选实施例,获取所述W的核心词向量的步骤包括:从所述W在搜索日志中覆盖的query里选取查询次数最多的N1个query,并在所述N1个query从搜索引擎返回的搜索结果中确定核心词及核心词的权重,以形成所述W的核心词向量,其中N1为正整数。 Step [0009] According to one embodiment of the present invention, preferably, the acquisition core word vector W comprises: selecting a maximum number of queries from the query of the query the N1 W covered in the search logs, and the N1 determining a query word and core word of the core search engine search results returned from the weight vector to form the core of the word W, where N1 is a positive integer.

[0010] 根据本发明之一优选实施例,获取所述特定领域的核心词向量的步骤包括:利用所述特定领域的种子query获取搜索引擎返回的搜索结果,并在该搜索结果中确定核心词及核心词的权重,以形成所述特定领域的核心词向量。 Step [0010] According to one embodiment of the present invention, preferably, the core word vector for the acquisition of specific areas include: the particular area of ​​seed acquired query search engine returns search results and determine the core word search results words and core weight vector to form the core of the domain-specific word.

[0011] 根据本发明之一优选实施例,所述特定领域的种子query的获取方式包括:方式一、从所述特定领域包含的所有候选需求模版中选取在搜索日志中覆盖query数最多的N2 个候选需求模版,并针对所述N2个候选需求模版,从每个候选需求模版覆盖的query中选择查询次数最多的M1个query作为种子query,其中N2及M1为正整数;或者,方式二、将预设的所述特定领域的槽关键词与预设的所述特定领域的限定词进行组合生成所述特定领域的种子query ;或者,方式三、利用所述方式一选择出部分种子query后,利用预设的所述特定领域的槽关键词词典将所述方式一选择出的种子query中的槽关键词替换为所述槽关键词词典中的其他槽关键词得到扩展的种子query ;所述部分种子query和所述扩展的种子query构成所述特定领域的种子query。 [0011] According to one embodiment of the present invention, preferably, the particular query art manner of obtaining a seed comprising: a mode select query covering the largest number N2 in the search logs from all the candidate templates the specific needs in the art comprising candidate templates needs, and demand for the N2 candidate templates, each candidate query from the selected template covers the needs of most of the M1 number of queries query query as a seed, where N2 and M1 are positive integers; or mode 2 Alternatively, after three mode, a mode selected by said part of the seed query; the keyword qualifier the groove with a preset specific area of ​​the predetermined particular area are combined to generate the domain-specific query seed using the preset specific area of ​​the groove keyword dictionary keyword alternative embodiment the grooves of a selected seed query to other slots in the slot in the keyword dictionary expanded seed keyword query; the said query part of the seed and expanded seed the seed query query constituting the particular area.

[0012] 根据本发明之一优选实施例,提取候选需求模版W的泛化能力特征的步骤包括: 确定所述W对应的槽关键词序列,统计所述W对应的槽关键词序列中互异的槽关键词序列的数量并依据该数量计算所述W的泛化能力特征,其中所述W对应的一个槽关键词序列是由所述W在搜索日志中覆盖的一个query中的槽关键词组成的序列。 Step [0012] According to one embodiment of the present invention, preferably, the generalization ability of extracting feature candidate template needs W comprises: determining the sequence of keywords W corresponding grooves, said grooves statistical keyword sequence corresponding to mutually different W number of slots based on a keyword query sequence and the feature quantity calculation generalization of the W, wherein W corresponding to a groove of the sequence is covered by the keyword in the search log W in the groove Image sequences.

[0013] 根据本发明之一优选实施例,提取候选需求模版W的边界词特征的步骤包括:将所述特定领域包含的所有候选需求模版切分为片段,从得到的各切分片段中选取正片段并确定各正片段的权重以生成所述特定领域的正向量,从得到的各切分片段中选取负片段并确定各负片段的权重以生成所述特定领域的负向量;确定所述W的切分片段的权重并使用所述W的切分片段及切分片段的权重构成所述W的向量;计算所述W的向量与所述正向量的相似度S1,以及,所述W与所述负向量的相似度S2,并根据所述S1与所述S2的差值得到所述W的边界词特征。 Step [0013] According to one embodiment of the present invention, preferably, a border candidate word feature extraction needs stencil W comprises: cutting all candidate templates the specific needs of the field is divided into fragments comprises selecting a fragment from each of the obtained segmentation determining the weight of each segment n and segment n to generate the n weight vector specific areas, from the respective segments select the negative sliced ​​fragment was determined and the weight of each segment negative negative weight vector to generate the specific areas; determining right sliced ​​segments and using the weights W W slicing sliced ​​fragment and right segments of the reconstructed vector of W; W similarity calculating the vector S1 and the vector is positive, and, said W and the negative vector similarity S2, and wherein the boundary of the W word obtained according to the difference between the S1 and S2.

[0014] 根据本发明之一优选实施例,所述特定领域的正向量和负向量的生成过程具体包括:确定各切分片段对应的槽关键词序列,其中一个切分片段对应的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query中的槽关键词组成的序列; Tl.如果一个切分片段对应的所有槽关键词序列相同,则将该切分片段作为负片段,且该负片段的权重为1 ;T2.如果一个切分片段对应的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值,则将该切分片段作为负片段,且该负片段的权重为所述比例P ;T3.确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量,得到该数量中的最大值Z1,如果一个切分片段不满足所述Tl及所述Τ2中的条件,且该切分片段对应的互异的 [0014] According to one embodiment of the present invention, preferably, the vector generation process positive and negative areas of the particular vector comprises: determining the sequence of each keyword groove segments corresponding segmentation, wherein a groove corresponding to a key segment segmentation query word sequence is a candidate for a demand of the segmentation template fragment covered by the composition comprising a groove keyword;. Tl segmentation if a segment corresponding to the same sequence of all keyword slots, then the segmentation fragment as a negative fragment, and the weight of the weight of negative fragment 1;. T2 segmentation if a segment sequence corresponding to all slots keywords are not identical, but there is a groove in the keyword sequence representing the sequence of all keyword segmentation groove fragment ratio P is greater than a first predetermined threshold value, then the segmentation fragment fragment as negative, and the negative weight is the weight ratio of segment P;. T3 is determined for each candidate template needs corresponding specific areas contained different from each other keyword sequence number of the grooves, to obtain the maximum number of Z1, if a condition is not satisfied fragment segmentation of the Tl and the Τ2, and the segmentation of the segment corresponding to mutually different 关键词序列的数量Z2与所述Z1的比值大于预设的第二阈值,则将该切分片段作为正片段,且该正片段的权重为Z2与Z1的比值。 The ratio of the number of keywords to the sequences Z1 Z2 is greater than a predetermined second threshold value, then the fragment as a positive sliced ​​fragment, and the fragment n is the ratio of the weight of Z1 and Z2.

[0015] 根据本发明之一优选实施例,确定所述W的切分片段的权重的步骤包括:统计所述W的切分片段在所述W中出现的次数并将该次数作为对应切分片段的权重。 [0015] According to one embodiment of the present invention, preferably, comprising determining the weights W of the segmentation step segments: W times the statistical segmentation fragments appear in the W and the number of times corresponding to a slicing weight weight fragments.

[0016] 根据本发明之一优选实施例,所述步骤C包括:从候选需求模版中选取标准模版集;利用所述标准模版集训练提取的各特征对应的参数,将训练中使得所述标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重;使用提取的各特征及特征的权重计算候选需求模版的得分,并根据该得分对各候选需求模版进行排序。 [0016] According to one embodiment of the present invention, preferably, said step C comprising: selecting from the set of standard template of the candidate templates needs; wherein each of the training set using the standard template corresponding to the extracted parameter, so that the standard training weight parameter value of the set of templates stencil rank all candidate needs stencil not more forward as the corresponding features weight; scoring candidate needs stencil right to use the extracted each feature and feature of calculating the weight, and based on the score needs of each candidate templates sort.

[0017] 根据本发明之一优选实施例,从候选需求模版中选取标准模版集的步骤包括:针对提取的每个特征分别基于特征值对候选需求模版进行排序,分别针对每个特征取排列在前N3位的候选需求模版作为对应特征的模版集合,其中N3为正整数;取各特征的模版集合之间的交集作为标准模版集。 [0017] According to one embodiment of the present invention, preferably, selected from the candidate set of standard template needs templates comprises the step of: for each feature, respectively, based on the extracted feature value templates are sorted candidate demand, were taken for each feature arranged N3 bits before the candidate needs stencil corresponding features as a set of templates, wherein N3 is a positive integer; the intersection between the set of templates each feature taken as a standard set of templates.

[0018] 根据本发明之一优选实施例,所述步骤D包括:将排序位于前N4位的候选需求模版选取为最终需求模版,其中N4为正整数;利用排序位于前M2位的候选需求模版的边界词获取关键词集合,并将排序位于前N4位之后的候选需求模版中的边界词均属于所述关键词集合的候选需求模版选取为最终需求模版,其中所述边界词为候选需求模版中未被泛化的词,所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词,M2 为正整数且M2小于或等于N4。 [0018] According to one embodiment of the present invention, preferably, the step D comprises: sorting the candidate bit in the front N4 template needs to select the final demand template, where N4 is a positive integer; candidate needs stencil positioned by a sorting position prior M2 obtaining a set of keywords in a word boundary, the sorted candidate word needs at a boundary before N4 bits after the stencil in the stencil all belonging to the candidate needs to select a final set of keywords needs template, wherein the boundary is a word candidate template needs words that are not generalized, the keyword is a mutual information between the word and the word boundary or the word boundary synonymous words satisfying the requirements, M2 and M2 is a positive integer equal to or less than N4.

[0019] 本发明还提供了一种生成领域需求模版的装置,包括:候选模版获取单元,用于获取特定领域的候选需求模版;特征提取单元,用于提取候选需求模版的特征,其中所述特征提取单元至少包括相似度特征提取单元、泛化能力特征提取单元或边界词特征提取单元中的一个,所述相似度特征提取单元用于提取表征候选需求模板与所述特定领域之间紧密度的相似度特征,所述泛化能力特征提取单元用于提取表征候选需求模板覆盖用户搜索请求query能力的泛化能力特征,所述边界词特征提取单元用于提取表征候选需求模板中未泛化的词语对候选需求模板正确性所产生影响的边界词特征;排序单元,用于利用所述特征提取单元提取的特征对候选需求模版进行排序;选取单元,用于根据所述排序单元排序的结果从候选需求模版中选择最终需求模版作为特定 [0019] The present invention further provides an apparatus for generating a demand in the field of stencil, comprising: a candidate template acquisition unit configured to acquire a particular field needs candidate template; feature extraction means for extracting a feature candidate needs template, wherein said feature extraction unit comprises at least a similarity feature extraction unit extracting characteristic word feature generalization boundary extraction unit or a unit, the similarity feature extraction means for extracting characterizing the tightness between the candidate template with the specific needs of the art similarity feature, the generalization ability characterizing feature extraction unit for extracting a candidate search request user demand template overlay feature generalization capability query, the word boundary extracting means for extracting a feature characterizing the candidate template needs not generalization characterized by word boundary terms of impact on the demand for the candidate template correctness; sorting unit, for utilizing the feature extraction unit extracts features of the candidate templates needs sorting; a selecting unit, according to the results of the sorting unit to sort the final selection from the candidate template needs demand as the specific template 域的需求模版。 Demand domain template.

[0020] 根据本发明之一优选实施例,所述候选模版获取单元包括:限定单元,用于从搜索日志中选取用户query中与预设的所述特定领域的限定词匹配的query ;泛化单元,用于将所述限定单元选取的query中与预设的所述特定领域的槽关键词匹配的部分替换为通配符,得到候选需求模版。 [0020] According to one embodiment of the present invention, preferably, the candidate template obtaining unit comprising: defining means for selecting a user query in the particular field with the preset qualifier matches the query from the search logs; generalization means for replacing the selected portion of the query definition unit with a preset specific area of ​​the grooves as a wildcard matching keywords, the candidate needs to obtain the template.

[0021] 根据本发明之一优选实施例,所述候选模版获取单元进一步包括过滤单元,用于根据预设的对所述特定领域的槽位数量要求,从所述泛化单元得到的候选需求模版中过滤掉不满足槽位数量要求的候选需求模版。 [0021] According to one embodiment of the present invention, preferably, the candidate obtaining unit further comprises a filter template unit for a preset number of requests for the particular field of the slot, resulting from the candidate needs generalizing unit template filter out the candidate does not meet the needs of the number of slots required template.

[0022] 根据本发明之一优选实施例,所述相似度提取单元包括:模版词向量生成单元,用于在提取候选需求模版W的相似度特征时,获取所述W的核心词向量;领域词向量生成单元,用于获取所述特定领域的核心词向量;计算单元,用于计算所述W的核心词向量与所述特定领域的核心词向量之间的相似度,并将该相似度作为所述W的相似度特征。 [0022] According to one embodiment of the present invention, preferably, the similarity extracting unit comprises: word template vector generating means for extracting similarities of feature candidates when the template needs of W, W obtain a vector for the core word; FIELD word vector generating unit, configured to obtain a core word vector for the specific area; calculation unit for calculating a similarity between the word W core vectors and the specific area of ​​the core word vector, and the similarity W as the similarity of the feature. [0023] 根据本发明之一优选实施例,所述模版词向量生成单元从所述W在搜索日志中覆盖的query里选取查询次数最多的N1个query,并在所述N1个query从搜索引擎返回的搜索结果中确定核心词及核心词的权重,以形成所述W的核心词向量,其中所述N1为正整数。 [0023] According to one embodiment of the present invention, preferably, the vector generation unit selected word template most queries from query query number N1 of the cover W in the search logs, and the N1 in the query from the search engine determining core word and core word search results returned by weight, to form the core of the W word vector, wherein said N1 is a positive integer.

[0024] 根据本发明之一优选实施例,所述领域词向量生成单元利用所述特定领域的种子query获取搜索引擎返回的搜索结果,并在该搜索结果中确定核心词及核心词的权重,以形成所述特定领域的核心词向量。 [0024] According to one embodiment of the present invention, preferably, the word vector field generating unit uses the domain-specific query seeds get results returned by the search engine, and to determine the weight of the core word and core word of the search results in weight, to form the core word vector specific areas.

[0025] 根据本发明之一优选实施例,所述领域词向量生成单元获取所述特定领域的种子query的方式包括:方式一、从所述特定领域包含的所有候选需求模版中选取在搜索日志中覆盖query数最多的N2个候选需求模版,并针对所述N2个候选需求模版,从每个候选需求模版覆盖的query中选择查询次数最多的M1个query作为种子query,其中N2及M1为正整数;或者,方式二、将预设的所述特定领域的槽关键词与预设的所述特定领域的限定词进行组合生成所述特定领域的种子query ;或者,方式三、利用所述方式一选择出部分种子query后,利用预设的所述特定领域的槽关键词词典将所述方式一选择出的种子query中的槽关键词替换为所述槽关键词词典中的其他槽关键词得到扩展的种子query ;所述部分种子query和所述扩展的种子query构成所述特定领域的种子query。 [0025] According to one embodiment of the present invention, preferably, the word vector field generating unit acquires the specific field of seed query mode comprises: a mode selecting in search logs from all candidate templates the specific needs in the art comprising covering the largest number N2 candidate query templates needs, and demand for said candidate template N2, M1 select the most number of queries from a query needs of each candidate query templates covered query as a seed, where N2 and M1 is positive integer; qualifier field or the particular way two grooves keywords preset in the preset particular area are combined to generate the domain-specific seed Query; Alternatively, three way, with the embodiment after a selected portion of the seed query, using the preset specific area of ​​the groove keyword dictionary a selected manner in the groove seed query keywords replaced with another groove of the groove keyword in the keyword dictionary the seed expanded query; query and the portion of the expanded seed seed seed query query constituting the particular area.

[0026] 根据本发明之一优选实施例,所述泛化能力特征提取单元在提取候选需求模版W 的泛化能力特征时,确定所述W对应的槽关键词序列,统计所述W对应的槽关键词序列中互异的槽关键词序列的数量并依据该数量计算所述W的泛化能力特征,其中所述W的一个槽关键词序列是由所述W在搜索日志中覆盖的一个query中的槽关键词组成的序列。 When [0026] According to one embodiment of the present invention, preferably, the generalization capability feature extraction unit extracts features generalization candidate templates W needs to determine the sequence of the keywords groove corresponding to W, said W corresponding statistics the sequence number of cross grooves keyword slots keyword-specific sequence and calculated according to the number of said feature generalization W, wherein a sequence of said groove keyword W is covered by the search log W in a groove keyword query sequence consisting of.

[0027] 根据本发明之一优选实施例,所述边界词特征提取单元包括:切分单元,用于将特定领域包含的所有候选需求模版切分为片段;正负向量生成单元,用于从所述切分单元得到的各切分片段中选取正片段并确定正片段的权重以生成所述特定领域的正向量,从得到的各切分片段中选取负片段并确定各负片段的权重以生成所述特定领域的负向量;模版向量生成单元,用于在提取候选需求模版W的边界词特征时,确定所述W的切分片段的权重并使用所述W的切分片段及切分片段的权重构成所述W的向量;相似度计算单元,用于计算所述W的向量与所述正向量的相似度S1,以及,所述W的向量与所述负向量的相似度S2,并根据所述S1与所述S2的差值得到所述W的边界词特征。 [0027] According to one embodiment of the present invention, preferably, the boundary feature word extraction unit comprises: segmentation means for all the candidate templates needs to cut into fragments comprising a particular area; negative vector generation unit, for the the cutting of each sliced ​​segment division units derived select positive fragments and determine the weight of the positive segment weights to generate the n vectors in specific areas, negative fragment from each of the sliced ​​segments obtained select and determine the weight of each negative segment weights to generating a negative vector of the specific areas; template vector generating means for extracting a boundary when the feature candidate word template W needs to determine the weights W segmentation and re-use of fragments and fragment cut segmentation of the W sub fragment of the weight vector W is reconstituted; similarity calculating unit, for calculating a degree of similarity S1 of the vector W and the positive vector, and the vector W and the negative vector similarity S2, and wherein a boundary of the W word obtained according to the difference between the S1 and S2.

[0028] 根据本发明之一优选实施例,所述正负向量生成单元包括:槽关键词序列确定单元,用于确定各切分片段对应的槽关键词序列,其中一个切分片段对应的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query中的槽关键词组成的序列; 正负片段选取单元,用于按照下列方式从各切分片段中选取正片段和负片段以及确定正片段和负片段的权重:T1.如果一个切分片段对应的所有槽关键词序列相同,则将该切分片段作为负片段,且该负片段的权重为1 ;Τ2.如果一个切分片段对应的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值,则将该切分片段作为负片段,且该负片段的权重为所述比例P ;Τ3.确定特定领域包含的每个候选需求模版对应的互异的槽关键词序 [0028] According to one preferred embodiment of the present invention, the positive and negative vector generation unit comprises: a groove keyword sequence determination unit for determining the sequence of each groove Image Segmentation segments corresponding, wherein a segmentation of a segment corresponding to sequence comprising a groove keyword query sequence is a template of the candidate segmentation needs fragment covered by the keyword groove thereof; negative segment selecting means for selecting a fragment from each of the n segments in segmentation and in the following manner negative fragments and determining weights a positive segment and a negative segment weight:.. T1 same for all grooves keyword corresponding sequence if a segmentation fragment, then the segmentation fragment as negative fragment, and the weight of the negative segment weight of 1; Τ2 if All slots keyword sequence corresponding to a fragment of segmentation are not identical, but there are a sequence of keyword slots in all slots occupied by the keyword segmentation sequence fragment ratio P is greater than a first predetermined threshold value, then the cutting partial fragment as a fragment of a negative, and the negative weight is the weight ratio of segment P;. Τ3 determining the demand for each candidate template corresponding specific areas contained in mutually different groove Image sequence 的数量,得到该数量中的最大值Z1,如果一个切分片段不满足所述Tl及所述Τ2中的条件,且该切分片段对应的互异的槽关键词序列的数量Z2与所述Z1的比值大于预设的第二阈值,则将该切分片段作为正片段,且该正片段的权重为Z2与Z1的比值。 Number number to obtain the number of the maximum value Z1, if a sub-segment does not satisfy the Tl and the Τ2 conditions in cutting, slicing and the fragment corresponding to mutually different groove with the keyword sequence Z2 Z1 ratio greater than a preset second threshold value, then the fragment as a positive sliced ​​fragment, and the weight ratio of the positive segment weight Z2 and Z1. [0029] 根据本发明之一优选实施例,所述模版向量特征生成单元在确定所述W的切分片段的权重时,统计所述W的切分片段在所述W中出现的次数并将该次数作为对应切分片段的权重。 [0029] According to one embodiment of the present invention, preferably, the number of times the template feature vector determining unit, when the weight W is sliced ​​segments weight W of the statistical segmentation fragments appear in the generation and W as the number of times corresponding to the right weight fragments segmentation.

[0030] 根据本发明之一优选实施例,所述排序单元包括:标准模版集选取单元,用于从候选需求模版中选取标准模版集;训练单元,用于利用所述标准模版集训练提取的各特征对应的参数,将训练中使得所述标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重;计算与排序单元,用于使用所述特征提取单元提取的各特征及所述训练单元得到的各特征的权重计算候选需求模版的得分,并根据该得分对候选需求模版进行排序。 [0030] According to one embodiment of the present invention, preferably, the sorting unit comprising: a set of standard template selection means for selecting from a set of standard template of the candidate templates needs; training unit, using a standard template for the extracted set of training corresponding to each characteristic parameter, a training set such that the standard template stencil rank all candidate template parameter demand value not more forward as the corresponding feature weights; calculating sorting unit, for using the features weight of each feature extraction unit extracts features of each of the training unit and calculates the score obtained heavy demand template candidates, and ranks the evaluated candidate templates based on the needs of the score.

[0031] 根据本发明之一优选实施例,所述标准模版集选取单元包括:模版集合确定单元, 用于针对提取的每个特征基于特征值对候选需求模版进行排序,分别针对每个特征取排列在前N3位的候选需求模版作为对应特征的模版集合,其中队为正整数;交集单元,用于取各特征的模版集合之间的交集作为标准模版集。 [0031] According to one embodiment of the present invention, preferably, the set of standard template selecting unit comprises: means determining a set of templates, the template needs to be sorted candidate based on a feature value for each feature extraction, were taken for each feature aligned first set of features as a template corresponding to a candidate bit demand N3 template, where the team is a positive integer; means the intersection, the intersection between the set of templates for each feature taken as a standard set of templates.

[0032] 根据本发明之一优选实施例,所述选取单元包括:第一选取单元,用于将排序位于前队位的候选需求模版选取为最终需求模版,其中N4为正整数;第二选取单元,用于利用排序位于前礼位的候选需求模版的边界词获取关键词集合,并将排序位于前N4位之后的候选需求模版中的边界词均属于所述关键词集合的候选需求模版选取为最终需求模版,其中所述边界词为候选需求模版中未被泛化的词,所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词,M2为正整数且M2小于或等于N4。 [0032] According to one preferred embodiment of the present invention, the selecting unit comprises: a first selecting unit configured to sort the candidate needs stencil positioned before the team is selected as the final demand bits template, where N4 is a positive integer; a second selection means for utilizing the candidate needs to sort the front Li bit word boundary template acquiring a set of keywords, the sorted word boundary candidates needs stencil located in front N4 bits after the demand belonging to the candidate templates are selected the keyword set as final demand template, wherein the boundary is a word candidate template needs not generalization words, the keyword is a mutual information between the boundaries of the words are synonyms of words or word boundary and to meet the requirements of word, M2 and M2 is a positive integer equal to or less than N4.

[0033] 由以上技术方案可以看出,本发明提供了一种通用性的领域需求模版的生成方法,针对不同的领域,均可通过本方法自动挖掘候选需求模版,并提取候选需求模版的特征对候选需求模版的质量进行评定,从而能够在候选需求模版中得到高质量的需求模版。 [0033] As can be seen from the above technical solutions, the present invention provides a method for generating a universal demand for a stencil areas for different areas, can automatically identify candidate templates by the present method needs and demands template feature extraction candidate the quality of the candidate template needs to be assessed, it is possible to obtain high quality requirements demand template candidate templates. 本发明得到的高质量的各个领域的需求模版为搜索引擎理解用户的行为目的提供了保障。 Template needs in various fields of the present invention to obtain high-quality search engine users understand the purpose of the act is to provide a guarantee.

【附图说明】 BRIEF DESCRIPTION

[0034] 图1为本发明中生成领域的需求模版的方法的流程示意图; [0034] FIG. 1 is a flow field generated in the invention needs a schematic template method;

[0035] 图2为本发明中获取候选需求模版的实施例的流程示意图; [0035] FIG. 2 acquires the present invention needs candidate template flow schematic of an embodiment;

[0036] 图3为本发明中利用种子query获取搜索引擎返回数据的示意图; [0036] FIG. 3 is a schematic of seed invention acquires query search engines return data;

[0037] 图4为本发明中生成领域需求模版的装置的实施例的结构示意框图; Demand means stencil art [0037] FIG. 4 of the present invention produced in the structure of an embodiment of a schematic block diagram;

[0038] 图5为本发明中相似度特征提取单元的实施例的结构示意框图; Similarity feature extraction block diagram illustrating a schematic structure of an embodiment of the unit [0038] FIG. 5 of the present invention;

[0039] 图6为本发明中边界词特征提取单元的实施例的结构示意框图; [0039] FIG. 6 word boundary feature extraction block diagram illustrating a schematic structure of an embodiment of the present invention means;

[0040] 图7为本发明中标准模版集选取单元的实施例的结构示意框图。 [0040] FIG. 7 of the present invention, the standard set of templates to select a schematic block diagram of an embodiment of the unit.

【具体实施方式】 【detailed description】

[0041] 为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。 [0041] To make the objectives, technical solutions, and advantages of the invention more apparent, the accompanying drawings and specific embodiments of the present invention will be described in detail with.

[0042] 请参考图1,图1为本发明中生成领域的需求模版的方法的流程示意图。 [0042] Referring to FIG. 1, a schematic flow chart of the present invention the field generated template method needs. 如图1所示,该方法包括: As shown in FIG 1, the method comprising:

[0043] 步骤SlOl :获取特定领域的候选需求模版。 [0043] Step SlOl: acquiring candidate template needs in specific areas. [0044] 步骤S102 :提取候选需求模版的特征。 [0044] the step S102: feature extraction candidate template needs.

[0045] 步骤S103 :利用提取的特征对候选需求模版进行排序。 [0045] Step S103: the candidate templates are sorted using a demand feature extraction.

[0046] 步骤S104 :根据排序的结果从候选需求模版中选取最终的需求模版作为特定领域的需求模版。 [0046] Step S104: Select the final demand as a template the specific needs of the art stencil from the stencil according to the needs of the candidate result of sorting.

[0047] 下面通过具体的实施例对上述方法进行详细介绍。 [0047] The following detailed description of the methods described above by specific examples.

[0048] 本发明中,特定领域是反映用户搜索目的的一个范围,如公交领域、天气领域等等,这些领域反映了用户搜索信息时的搜索目的。 [0048] In the present invention, a range of specific areas to reflect the user's search purposes, such as public transport field, the field of weather, etc., these areas reflect search object information when the user searches.

[0049] 请参考图2,图2为本发明中获取候选需求模版的实施例的流程示意图。 [0049] Please refer to FIG 2, a flow diagram of an embodiment of the template of the candidate needs acquired FIG. 2 of the present invention. 在本实施例中,利用了领域限定词词典与槽关键词词典对用户搜索日志(querylog)中的用户搜索请求query进行处理,从而生成候选需求模版。 In the present embodiment, by using the qualifier field dictionary and keyword dictionary grooves on searches for user search logs (querylog) the query request is processed, thereby generating a candidate template needs.

[0050] 领域限定词词典包含了与各个领域相关的词语,其中特定领域的限定词是与特定领域相关的词语,在本实施例中,特定领域的限定词用于在选取query时,对query进行过滤。 Dictionary defines the word [0050] The art contains various fields associated with the words, wherein the specific area is qualifier term associated with a particular field, in the present embodiment, the qualifier for the specific field in the selection of query, query for filter. 只有包含特定领域的限定词的query,才会进行泛化,泛化生成的候选需求模版,就属于特定领域的候选需求模版。 query only in specific areas comprising qualifier, will generalization, generalized demand generated candidate templates, the candidate needs to belong to a specific field of the template. 领域限定词词典中的词语可以通过下列途径收集得到: FIELD defining word lexicon can be collected in the following ways:

[0051] 首先可以从用户的query中挖掘领域种子词作为领域限定词,其中领域种子词可以通过人工的方式配置,或者采用人工的方式在搜索日志中标注。 [0051] First, the seed can be tapped from a field as the user's query qualifier field, wherein the seed field may be arranged to manually, or with manual search mode in annotation log.

[0052] 然后通过查找同义词词典,得到与领域种子词同义的词语作为领域限定词,此外, 还可以通过使用度量两个词紧密程度的互信息选取搜索日志中与种子词关联程度高的词语一并作为领域限定词。 [0052] Then by looking thesaurus give words synonymous with the word in the art as seed qualifier field, in addition, words can also be a measure of how closely the two mutually selected search log information with a high degree of association of seed words by using word together as a qualifier field. 词语之间的互信息可通过对大规模语料进行统计得到,由于属于现有技术,在此不再赘述。 Mutual information between words can be obtained by a large scale corpus statistics, because they belong to the prior art, it is not repeated here. 以公交领域为例,表1给出了部分领域限定词的示例: In the field of public transport, for example, Table 1 shows examples of some areas of qualifiers:

[0053] 表1 [0053] TABLE 1

[0054] [0054]

Figure CN102368260AD00121

[0055] 生成候选需求模版的过程,就是对query进行泛化的过程,所谓泛化,指的是将用户query中与特定领域的槽关键词匹配的部分替换为通配符。 [0055] The process of generating a candidate needs stencil, the process is to query generalization, called generalization refers to the user query domain-specific portion of the groove with a wildcard matching the keywords. 槽关键词是用于泛化的词语,通过查找槽关键词词典确定,该词典可通过收集各种专有名词得到。 Key words are words of grooves for generalization, determined by looking grooves keyword dictionary that can be obtained by collecting a variety of proper nouns.

[0056] 例如“北京15路公交车路线”这样的query,在泛化以后,可以得到“[城市名][公交路线]公交车路线”这样的需求模版。 [0056] For example, "Beijing No.15 bus route" this query, after generalization, can be "[city name] [Bus routes] Bus routes" such a demand template. 每一个“[],,符号代表模版的一个槽位,表示该位置在满足通配符属性要求的情况下可进行替换,例如上面这个模版与“上海郊14路公共车路线”也匹配。 Each "[] ,, template symbol represents a slot, which indicates the position in the case of wildcard attribute satisfies the requirements can be replaced, for example, the above template and" rural Shanghai public road vehicle-path 14 "are matched.

[0057] 在得到上述候选需求模版后,还可以根据对候选需求模版所属的特定领域预设的槽位数量要求决定是否对这些候选需求模版进行过滤处理。 [0057] After obtaining the candidate templates needs, but also according to the number of slots for the specific needs of the art stencil candidate belongs preset deciding whether the candidate templates needs filtration treatment requirements. 例如在火车信息查询领域, query中的可变信息一般仅涉及起点和终点,因此可以将火车信息查询领域的模版预定槽数设置为2,凡是不符合预定槽数要求的模版都会被过滤掉,以降低后续对候选需求模版进行处理的复杂度。 For example, in a train information query field, variable information in the query relates generally only start and end points, thus the stencil predetermined number of slots of the train information query field may be set to 2, the number of slots that do not meet predetermined requirements will be filtered out of the stencil, to reduce complexity of the subsequent candidate template needs to be processed.

[0058] 本实施例中,步骤S102中提取的特征,至少包括以下特征中的一种: [0058] In the present embodiment, S102 in the step of extracting features comprises at least one of the following features:

[0059] 相似度特征,用于描述候选需求模版与特定领域联系的紧密度;泛化能力特征,用于描述候选需求模版覆盖用户搜索请求query的能力;边界词特征,用于描述候选需求模版中未被泛化的词语对候选需求模版的正确性产生的影响。 [0059] The similarity of characteristics for describing the candidate template associated with a particular field needs tightness; generalization characteristics, covering ability of the candidate templates the user demand for describing a search request query; word feature boundaries, requirements for describing the candidate template words that are not affecting the correctness of the generalization of the candidate needs to produce a template.

[0060] 下面对上述三个特征的计算方式的实施例进行具体介绍。 [0060] The following specifically describes embodiments of the calculation of the above-described three features.

[0061] 1、相似度特征 [0061] 1, wherein the degree of similarity

[0062] 一个候选需求模版W的相似度特征可以通过计算候选需求模板W的核心词向量与该候选需求模板W所属特定领域的核心词向量之间的余弦距离得到,具体可以采用下列公式⑴进行计算: [0062] needs a similarity feature candidate templates may be cosine distance W between the core word and the relevant candidate needs specific template art vector W obtained by the core word template W needs the candidate vector calculation, the following equation can be used specifically for ⑴ calculation:

[0063] sim_score = CossSimilarity (pattern_vector, seed_query_centroid) (1) [0063] sim_score = CossSimilarity (pattern_vector, seed_query_centroid) (1)

[0064] 其中,sim_score表示候选需求模版W的相似度特征值,pattern_vector表示候选需求模板W的核心词向量,seed_query_centriod表示特定领域的核心词向量, CossSimilarity表示余弦相似度函数。 [0064] wherein, sim_score feature value represents the similarity of the candidate needs stencil W, pattern_vector represents core word candidate vector W needs the template, seed_query_centriod core word vector represents a particular field, CossSimilarity represents a cosine similarity function.

[0065] 核心词向量,是由核心词为向量特征形成的向量。 [0065] The core word vector, wherein the vector is a vector formed by the core word is. 因此,在计算相似度特征时,首先要确定如何选取核心词。 Therefore, when calculating the similarity characteristics, we must first determine how to select the core word.

[0066] 在确定特定领域的核心词时,可以利用该特定领域的种子query获取搜索引擎返回的数据,并利用搜索引擎返回的数据确定核心词。 [0066] In determining the core words in specific areas, you can take advantage of this particular field of seed query to get data returned by the search engine, and to determine the core word using the data returned by search engines. 请参考图3,图3为本发明中利用种子query获取搜索引擎返回数据的示意图。 Please refer to FIG. 3, a schematic diagram of the invention using a seed acquired search engine query data of FIG. 3 is returned. 如图3所示,种子query为“北京15路公交车路线”,该种子query可以从搜索引擎得到多个搜索结果。 3, the seed for the query "Beijing 15 bus routes," the seed query can get search results from multiple search engines. 将这些搜索结果的标题(title)和内容(text)进行预处理(包括分句、分词、去除停用词等)后,得到统计语料。 After these titles search results (title) and the content (text) pretreatment (including the clause, word, remove stop words, etc.) to obtain a statistical corpus. 针对统计语料中的每个词,统计该词出现的句子数及该词与检索词共同出现的句子数,并统计包含检索词的句子数,其中检索词是种子query分词后得到的词语。 The number of sentences for each word corpus statistics, the number of sentences and word search terms, statistical occurrences of the term co-occurrence, and counts the number of sentences containing the terms of which search terms are the words after seed query word get.

[0067] 得到上述信息后,可采用下列公式(2)计算每个词的权重,并将权值大于设定阈值的词语作为核心词,这些核心词的权重相应地构成了对应向量特征的权重。 After [0067] to obtain the above information may be employed the following formula (2) is calculated for each word weight, and the weight is greater than the words in the set threshold as core words, the weight of these core words weight respectively form a corresponding vector features weighting .

[0068] [0068]

CentraHtysch term(w)= J。 CentraHtysch term (w) = J. f;、(=-二洲、,1。垂HD ( 2 ) f;., (= - 1 ,, Island two vertical HD (2)

一 log(5/ {w) +1) + log(5/ {sen _ term) +1) A log (5 / {w) +1) + log (5 / {sen _ term) +1)

[0069]其中,Centralityseh te„(w)表示词w 的权值,Co(w,sch_term)表示词w 与检索词sch_term共同出现的句子个数;sf (sch_term)表示含有检索词sch_term的句子个数; Sf(W)表示包含词w句子个数;idf(w)表示词w的逆向文档频率,可通过查找利用大规模语料统计得来的逆文档频率表得到。 [0069] wherein, Centralityseh te "(w) represents the weight of word w, Co (w, sch_term) indicates the number of word w and the search word sentence appearing in common sch_term; sf (sch_term) comprises a sentence number of search terms sch_term number; Sf (W) comprising a number of word w represents a sentence; IDF (w) denotes the inverse document frequency of the word w may be obtained by using a large-scale corpus to find statistics come inverse document frequency table.

[0070] 在获取特定领域的种子query时,可采用下列几种实施方式: [0070] When acquiring specific area of ​​the seed query, several embodiments can be the following:

[0071] 实施方式一: [0071] The first embodiment:

[0072] 在特定领域包含的候选需求模版中选取在搜索日志中覆盖的query数最多的N2 个候选需求模版,并针对这N2个候选需求模版,从每个候选需求模版覆盖的query中选择查询次数最多的M1个query作为种子query,其中N2J1为正整数,优选地,M1等于1。 [0072] The requirements of candidate templates contained in specific areas of the query to select the most number N2 of candidate templates cover the needs in the search logs, and the demand for these candidate template N2, select each candidate query from the query template needs covered the highest number of M1 as a seed query query, wherein N2J1 is a positive integer, preferably, M1 is equal to 1. 例如下面表2为公交领域的候选需求模版:[0073]表 2 For example, the following Table 2 as a candidate in the field of public transport needs a template: [0073] TABLE 2

Figure CN102368260AD00141

[0075] 假设N2 = 2,M1 = 1,则表3示出了针对表2中的候选需求模版采用实施方式一得到的种子query及其相应的候选需求模版。 [0075] assumed that N2 = 2, M1 = 1, Table 3 shows the seed and its corresponding candidate query template for the needs of the candidate table 2 needs a template obtained using the embodiment.

[0076]表 3 [0076] TABLE 3

Figure CN102368260AD00142

[0078] 在这种实施方式下,种子query来源于用户的真实query,能够更好地代表用户的习惯。 [0078] In this embodiment, the seed query query from real users, can better represent the user's habits.

[0079] 实施方式二: [0079] Second Embodiment:

[0080] 将特定领域的槽关键词与特定领域限定词进行组合生成种子query。 [0080] The specific area of ​​the groove with a particular keyword qualifier field generated by combining a seed query.

[0081] 以生成公交领域的种子query为例,请参考表4 : [0081] In the field of public transport seeds generated query an example, please refer to Table 4:

[0082] 表4 [0082] TABLE 4

Figure CN102368260AD00151

[0084] 这种方式下,生成的种子query结构简单。 [0084] In this manner, a simple query generation seed structure.

[0085] 优选地,可采用实施方式三来获取种子query。 [0085] Preferably, the third embodiment may be employed to obtain seeds query.

[0086] 实施方式三: [0086] Embodiment three:

[0087] 采用实施方式一的方法选出部分种子query,然后利用槽关键词词典将选取的种子query中的槽关键词替换为特定领域的其他槽关键词以得到扩展的种子query。 [0087] The embodiment of a method of selecting the query part of the seed, and the seed groove query using a keyword dictionary in the selected slot replacing keyword query other seed specific keywords grooves in the art to obtain a spread.

[0088] 例如表5所示为采用实施方式三得到的种子query。 [0088] Table 5 shows the example query is obtained by seed three embodiments.

[0089] 表5 [0089] TABLE 5

[0090] [0090]

Figure CN102368260AD00152

[0091] 上述过程可得到特定领域的核心词向量,下面将描述获取候选需求模版的核心词向量的过程。 Process core word vector [0091] The specific procedure available in the art, will be described below acquired candidate word needs core template vector.

[0092] 首先,与获取特定领域的核心词向量类似的,需要先获取统计语料。 [0092] First, obtain specific areas of core words of similar vectors, we need to obtain statistical corpus. 在获取统计语料时,首先从候选需求模版在搜索日志中覆盖的query里,选取查询次数最多的N1个query 作为待搜索query,然后使用这些待搜索query从搜索引擎中获取搜索结果,对这些搜索结果的title和text进行预处理,就可以得到统计语料了,其中N1为正整数。 When obtaining statistical corpus, starting with the query candidate needs a template covering in search logs in, select the largest number of inquiries N1 a query as to be a search query, and then use these to be the search query to obtain search results from search engines, these search title and text pretreatment results, the corpus statistics can be obtained, where N1 is a positive integer.

[0093] 在得到的统计语料中,统计每个词的在语料中出现的频率,并按照下列公式(3) 计算每个词的权重,权重大于设定阈值的词就可作为候选需求模版的核心词,核心词的权重即为对应的向量特征的权重。 [0093] In the obtained corpus statistics, the statistical frequency of occurrence of each word in the corpus, and the following formula (3) of each word is calculated weights, weight is greater than the set threshold as a candidate word needs template can core words, the weight is the weight of the right core word feature vectors corresponding to the weight.

[0094] Weight (w) = log (tf (w)+1) X log (idf (w)+1) (3) [0094] Weight (w) = log (tf (w) +1) X log (idf (w) +1) (3)

[0095] 其中,Weight (w)表示词w的权值,tf (w)表示词w的频率,idf (w)表示词w的逆向文档频率,可通过查找利用大规模语料统计得来的逆文档频率表得到。 [0095] where, Weight (w) is the weight of word w, tf (w) represents the frequency of the word w, idf (w) represents the inverse document frequency of the word w, can be used against large-scale corpus statistics come by looking for documents obtained frequency table.

[0096] 在得到候选需求模版的核心词向量与特定领域的核心词向量后,就可按照公式(1)计算候选需求模版的相似度特征了。 [0096] After obtaining the candidate word needs stencil core core word vector and a vector specific areas, can demand feature candidate template similarity calculated according to equation (1).

[0097] 2、泛化能力特征 [0097] 2, characterized in generalization

[0098] 泛化能力特征可用候选需求模版对应的槽关键词序列中互异的槽关键词序列的数量来衡量,其中候选需求模版对应的一个槽关键词序列是由候选需求模版在搜索日志中覆盖的一个query中的槽关键词组成的序列。 [0098] wherein the number of available slots generalization keyword sequence candidate corresponding to the needs of the stencil mutually different groove keyword to measure sequence, wherein the candidate needs a template slot corresponding to the keywords by the candidate sequence is demand in the template search logs consisting of a sequence of query keywords groove in the cover.

[0099] 例如对模版“[城市名][公交路线]公交车路线”,其覆盖的query有“北京15路公交车路线”、“上海郊14路公交车路线”、“沈阳铁西2线公交车路线”、“北京15路公交车路线图查询”,则槽关键词序列有“北京15路”、“上海郊14路”、“沈阳铁西2线”和“北京15 路”,互异的槽关键词序列为“北京15路”、“上海郊14路”和“沈阳铁西2线”,因此对模版“[城市名][公交路线]公交车路线”而言,它的泛化能力特征值就是3。 [0099] for example, the template "[city name] [Bus routes] bus routes", which covers the query of "Beijing 15 bus routes," "Shanghai suburbs 14 bus routes," "Shenyang West Line 2 bus route "," Beijing No.15 bus route map query, "the groove keyword sequence of" Beijing Road 15 "," 14 suburban Shanghai Road, "" Shenyang West 2 line "and" Beijing Road 15 ", mutual slot sequence-specific keywords as "Beijing Road 15", "14 suburban Shanghai Road" and "Shenyang West line 2", so the template "[city name] [bus routes] bus routes", for its pan 3 is the characteristic ability value.

[0100] 优选的,泛化能力特征采用下列方式进行计算。 [0100] Preferably, feature generalization is calculated in the following way. 首先确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量及该数量中的最大值,然后按照下列公式(4)计算每个候选需求模版的泛化能力特征值: First, determine the number and the maximum number of templates corresponding to each candidate needs cross grooves sequence-specific keywords contained in specific areas, and (4) is calculated for each candidate needs generalization template feature value according to the following formula:

[0101] general_scorei = log (pattern_dif_queryi+l) /log (max_dif_query+l) (4) [0101] general_scorei = log (pattern_dif_queryi + l) / log (max_dif_query + l) (4)

[0102] 其中,general_SCOrei表示候选需求模版i的泛化能力特征值,pattern_dif_ query,表示候选需求模版i对应的互异的槽关键词序列的数量,max_dif_query表示该候选需求模板i所属特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量中的最大值。 [0102] wherein, general_SCOrei template representing a candidate needs generalization characteristic value of i, pattern_dif_ query, i represents the number of candidate templates needs mutually different groove corresponding to a sequence of keywords, max_dif_query i indicates the candidate template ordinary needs of specific areas comprising number of cross grooves keyword candidate sequence-specific needs of each template corresponding to a maximum value.

[0103] 3、边界词特征 [0103] 3, wherein a word boundary

[0104] 边界词是候选需求模版中未被泛化的词语。 [0104] boundary word are words that are not a candidate needs a template generalization. 候选需求模版中未被泛化的词语对最终生成的模版的正确性产生影响。 Word candidate template needs not generalized effect on the final accuracy of the generated template. 例如在公交领域,“[城市名][公交路线]公交车路线” 这样的需求模版,显然比“公交卡断了怎么办[城市名]”这样的模版更能反映公交领域的需求。 For example in the field of public transportation, "[city name] [Bus routes] Bus routes" such a demand template, it is clear than the "bus card off how to do [city name]" This kind of template to better reflect the needs of public transport sector.

[0105] 在本发明中,候选需求模版W的边界词特征通过下面的公式(5)来计算。 [0105] In the present invention, wherein the word boundary candidate template needs W is calculated by the following equation (5).

[0106] boundary_word_score [0106] boundary_word_score

[0107] = CosSimilarity(pattern_centroid, positive_centroid) (5) [0107] = CosSimilarity (pattern_centroid, positive_centroid) (5)

[0108] -CosSimilarity(pattern_centroid, negative_centroid) [0108] -CosSimilarity (pattern_centroid, negative_centroid)

[0109] 其中,boundary_word_score为候选需求模版W的边界词特征,CosSimilarity为余弦相似度函数,patterr^centroid为候选需求模版W形成的向量,positive^entroid为特定领域的正向量,negative_centroid为特定领域的负向量。 [0109] wherein, boundary_word_score Ci characterized stencil W is a border candidate needs, CosSimilarity a cosine similarity function, patterr ^ centroid vector formed candidate template needs W, positive ^ entroid particular field vector is positive, negative_centroid specific field negative vector.

[0110] 下面分别介绍如何获取公式中的各个变量值。 [0110] The following sections describe how to obtain the value of each variable in the equation.

[0111] 生成特定领域的正负向量的过程包括: [0111] The process of generating specific areas of the positive and negative vector comprising:

[0112] 将特定领域包含的所有候选需求模版按照η元词组(n-gram) (η > 1)的方式进行切分,优选地,取η = 2,可得到各个切分片段,其中所谓n-gram就是能够进行语义表达的最小粒度的η个词语按顺序出现的组合,其中η为预设的正整数。 [0112] the needs of all the candidate templates the specific areas contained in the phrase segmentation element according to [eta] (n-gram) (η> 1) manner, preferably, taking η = 2, each of the segmentation obtained fragments, called n wherein -gram η is a combination of words can be a minimum particle size of the semantic order of appearance, where η is a predetermined positive integer. 例如对“[城市名][公交路线]公交车路线”这个模版,假设其能够进行语义表达的最小粒度的词语分别为“[城市名]”、“[公交路线],,和“公交车路线”,则该模版的2-gram的切分片段分别是“[城市名][公交路线]”、“[公交路线]公交车路线”,或者对“公交卡断了怎么办[城市名]”这个模版,假设其能够进行语义表达的最小粒度的词语分别为“公交卡”、“断了”、“怎么办”和“[城市名]”,则该模版的2-gram的切分片段分别是“公交卡断了”、“断了怎么办”、“怎么办[城市名]”。 For example, "[City] [Bus routes] bus routes" the template, it is assumed that the words in a minimum particle size capable of semantic expressions are "[city name]", "[transit directions],, and" Bus Route ", the template of sliced ​​fragments of 2-gram respectively." [city] [bus routes] "," [bus routes] bus routes, "or of" bus card off how to do [city name] " the template, it is assumed that the words in a minimum particle size capable of semantic expressions are "bus card", "down", "how to do" and "[city name]", the stencil 2-gram slicing fragments were It is "cut off from the bus," "broken how to do", "how do [city name]."

[0113] 从各切分片段中选取正片段和负片段,其中一个正片段就是正向量的一个向量特征,一个负片段就是负向量中的一个向量特征,并确定各个向量特征的权重。 [0113] is selected from the right segment of each segmented positive and negative segments fragments, fragments wherein n is a positive feature vector of a vector, a negative vector fragment is a negative feature vector, and determining the respective weight vector of features. 该过程包括: The process includes:

[0114] A.确定各切分片段对应的槽关键词序列,其中一个切分片段的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query中的槽关键词组成的序列。 [0114] A. determining the sequence of each groove Image Segmentation segments corresponding, wherein a segmentation of a channel segment comprising a sequence of keyword candidate query needs a template fragment of the segmentation in the grooves covered keyword composition the sequence of.

[0115] 例如,对切分片段“[城市名]公交”来说,包含该切分片段的候选需求模版及其覆的query如表6所示: [0115] For example, segmentation of the segment "[city name] bus", a fragment comprising the segmentation candidate demand query template and an overlying shown in Table 6:

[0116]表 6 [0116] TABLE 6

[0117 [0117

Figure CN102368260AD00171

[0118] 则对切分片段“[城市名]公交”而言,它的槽关键词序列包括“北京15路”、“上海36路”、“北京15路”、“杭州”。 [01] for the segmentation fragment "[City] bus" is concerned, it includes a series of slots keyword "Beijing Road 15", "Shanghai Road 36", "15 Beijing Road", "Hangzhou."

[0119] B.按照下列方式确定从各切分片段中选取正向量特征和负向量特征并确定各向量特征的权重: [0119] B. determining selected positive feature vector from each segmentation and segment feature vector and the negative vector of features to determine the weight of each weight in the following manner:

[0120] (1)如果一个切分片段的所有槽关键词序列相同,则该切分片段作为负向量特征, 且该负向量特征的权重为1。 [0120] (1) if a fragment of the same segmentation all slots keyword sequence, the segmentation vector fragment as a negative feature, and the feature vector of negative weight is 1.

[0121] (2)如果一个切分片段的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值时,则将该切分片段作为负向量特征,且该向量特征的权重为比例P,优选地,第一阈值为90%。 [0121] (2) if a sequence of all keyword groove segment segmentation are not identical, but there are a sequence of keyword slots in all slots occupied by the keyword segmentation sequence fragment ratio P is greater than a first predetermined when the threshold value, then the segmentation vector fragment as a negative feature, feature vector and the weight ratio of weight of P, preferably, the first threshold is 90%.

[0122] (3)确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量, 得到该数量中的最大值Z1,如果一个切分片段不符合上述两种情况,且该切分片段的互异的槽关键词序列的数量Z2与2工的比值大于预设的第二阈值时,则将该切分片段作为正向量特征,且该正向量特征的权重为Z2与Z1的比值,优选地,第二阈值为1%。 [0122] (3) determining the demand for each candidate template corresponding specific areas contained in mutually different number of slots keyword sequence to obtain the maximum number of Z1, if a segment does not comply sliced ​​above two cases, and mutually different number of grooves of the keyword segmentation sequence fragment and the second threshold value Z2 when a ratio of greater than the preset station 2, then the segmentation vector fragment characterized as positive, and the positive weight is a weight vector wherein Z2 and Z1 ratio, preferably, the second threshold is 1%.

[0123] 例如上面的切分片段“[城市名]公交”,互异的槽关键词序列分别为“北京15路”、 “上海36路”、“杭州”,互异的槽关键词序列的数目为3,其中“北京15路”在所有槽关键词序列中的比例为2/4,“上海/36路”在所有槽关键词序列中的比例为1/4,“杭州”在所有槽关键词序列中的比例为1/4,因此该切分片段不符合⑴或(2)中情况,所以该切分片段不属于负向量特征,假设特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量中的最大值为10且第二阈值为1 %,则由于3/10大于1 %,所以该切分片段应该作为正向量特征。 [0123] The above example segmentation fragment "[city name] bus", mutually different groove keyword sequences are "Road Beijing 15", "36 Road Shanghai", "Hangzhou", mutually different groove keyword sequence 3 is a number, wherein "Road Beijing 15" keyword proportion of all slots in the sequence 2/4, "Shanghai / 36 road" keyword proportion of all slots in the sequence 1/4, all slots in the "Hangzhou" ratio of 1/4 keyword sequence, this fragment does not comply sliced ​​⑴ or (2) in the case, so that the fragments do not belong to the segmentation negative feature vectors, each candidate hypothesis corresponding to specific needs template comprising cross-art the maximum number of different groove keyword sequence is 10 and the second threshold is 1%, the 1% greater than 3/10, so that the segment should be sliced ​​as a positive feature vectors.

[0124] 以表2所示的模版为例,采用上述方式得到的正向量与负向量分别如表7和表8 所示: [0124] In the template shown in Table 2 as an example, using the above obtained positive and a negative vector of the vector are shown in Table 7 and Table 8:

[0125]表 7 [0125] TABLE 7

[0126] [0126]

Figure CN102368260AD00172
Figure CN102368260AD00181

[0127]表 8 [0127] TABLE 8

[0128] [0128]

Figure CN102368260AD00182

[0129] 候选需求模版W形成的向量中的向量特征是候选需求模版W的切分片段,其中切分的方式与正负向量中描述的类似,而特征权重可由对应的切分片段在候选需求模版W中出现的次数来确定。 Vector feature vector [0129] candidate needs stencil W formed is a segmentation of fragment candidates needs stencil W analogously wherein sliced ​​manner negative vector described in the feature weights from the corresponding segmentation fragment candidate demand W appears in the number of templates to determine.

[0130] 例如“[城市名][公交路线]公交车路线”这个模版包含的切分片段分别为“[城 [0130] For example, "[city name] [Bus routes] bus routes" slicing fragments were contained in the template "[city

市名][公交路线]”和“[公交路线]公交车路线”,由于这两个切分片段在该模版中出现的次数都是1,所以模版“[城市名][公交路线]公交车路线”对应的向量特征“[城市名] [公交路线]”和“[公交路线]公交车路线”的特征权重分别都是1。如果一个模版为“[城市名][公交路线][城市名][公交路线]”,那么对这个模版的向量特征“[城市名][公交路线]”而言,特征权重就是2。 City Name] [Bus routes] "and" [bus route] bus routes ", due to the number of times these two split segments appear in the template is 1, so the template" [city name] [Bus routes] Bus line "corresponding vector characteristics" [city name] [bus routes] "and" [bus route] bus routes, "the feature weights are respectively 1. If a template is" [city name] [bus routes] [city name ] [bus routes] ", then the vector characteristics of the template" [city name] [bus routes] ", the feature weight is 2.

[0131] 候选需求模版的向量特征的特征权重的确定方式不唯一,除了以切分片段在模版中出现的次数作为对应的向量特征的特征权重,还可以采用布尔值的形式来确定对应的向量特征的特征权重,在此不对特征权重的计算方式进行限定。 Feature weight determining manner vector features [0131] candidate needs stencil is not unique, in addition to feature weight to frequency slicing fragments appear in the template as a vector characteristic corresponding to a weight, a Boolean value may also be employed to determine a corresponding vector wherein weight of the feature, defined in the heavy weight of this feature does not calculated.

[0132] 以表2所示的候选需求模版为例,各个候选需求模版的边界词特征如表9所示: [0132] In the candidate template needs an example shown in Table 2, each candidate word feature boundary template needs as shown in Table 9:

[0133]表 9 [0133] Table 9

[0134] [0134]

Figure CN102368260AD00191

[0135] 在步骤S103中,排序的过程包括: [0135] In step S103, the sorting process comprising:

[0136] 1、从候选需求模版中选取标准模版集,包括: [0136] 1, selected from the candidate set of standard template templates demand, comprising:

[0137] 针对提取的每个特征分别基于特征值对候选需求模版进行排序,分别针对每个特征取排列在前N3位的候选需求模版作为对应特征的模版集合,其中N3为正整数。 [0137] for each of the extracted feature based on the feature value of each candidate template needs to sort, taken separately arranged front N3 position as a candidate needs template set of templates corresponding features for each feature, wherein N3 is a positive integer.

[0138] 取各特征的模版集合之间的交集,并将该交集作为标准模版集。 [0138] on the intersection between the set of templates each feature, and the intersection as a standard set of templates.

[0139] 例如:针对特征1、2、3对候选需求模版Sl-SlO进行排序,得到表10 : [0139] For example: 1,2,3 candidate for the feature needs to sort stencil Sl-SlO, obtained Table 10:

[0140] 表10 [0140] TABLE 10

[0141] [0141]

Figure CN102368260AD00201

[0142] 如果N3 = 5,则特征1的模版集合为{S5 S6 S4 S2 Si},特征2的模版集合为{S4 S5 S2 S8 Si},特征3的模版集合为{S2 SlO S5 S6 Si},则各特征的模版集合的交集就是{Si S2 S5}。 A set of templates [0142] If N3 = 5, the feature 1 is {S5 S6 S4 S2 Si}, a set of templates 2 is characterized as {S4 S5 S2 S8 Si}, Template collection feature 3 is {S2 SlO S5 S6 Si} , the intersection of the sets of templates each feature is {Si S2 S5}.

[0143] 2、利用标准模版集训练提取的各特征对应的参数,将训练中使得标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重。 [0143] 2, using a standard set of trained templates corresponding to each extracted characteristic parameters, so that the standard template training set of parameter values ​​can not be ranked template further forward needs in all the candidate templates as the corresponding feature weights.

[0144] 公式(6)是基于提取的全部特征对所有候选需求模版进行排序时,各候选需求模版的得分,得分越高说明该候选需求模版的质量越好,因此排名就越靠前。 [0144] Equation (6) is based on all the extracted feature requirements for all candidate templates are sorted, the score of each candidate template demand, the higher the score the better the quality of the candidate template needs, thus more forward position.

[0145] total_score =入pim—score+ 入2general_score+ 入3boundary—word—score (6) [0145] total_score = pim-score + into the 2general_score + into 3boundary-word-score (6)

[0146] 其中,sim—score、general_score 禾口boundary—word—score 分另1J是才目"f以度特征、泛化能力特征及边界词特征的值,λ”入2及λ3是待训练的参数,代表了各个特征的权重。 [0146] wherein, sim-score, general_score Wo port boundary-word-score was only 1J other mesh points "f characteristic in degrees, and wherein the value of generalization word feature boundary, λ" is the two to be trained and λ3 parameter represents the weight of each feature heavy.

[0147] 训练参数采用的方法是梯度下降,通过连续迭代,不停调整参数的值,以使得标准模版集中的模版的排名尽可能地靠前,直到标准模版集中的模版在所有候选需求模版中的排序不再提前,这时的各参数值即为对应特征的权重。 [0147] The method of training parameters used is gradient descent, by successive iterations, the parameters are continuously adjusted value, so that the concentration of the standard template as possible ranking stencil forward until the standard template in the template set all the candidate templates requirements no longer sorted in advance, each parameter value at this time is the weight corresponding features weight.

[0148] 3、使用提取的各特征及其权重计算候选需求模版的得分,并根据该得分对候选需求模版进行排序,即采用下列公式(6)计算候选需求模版的得分,其中公式(6)中的λ ρ λ 2 及λ 3为训练得到的各个特征的权重。 [0148] 3, each of the features and weights using the extracted recomputed score candidate needs stencil, and the score of the candidate needs stencil sorted according i.e. using the following equation (6) calculates the candidate needs stencil score, wherein the formula (6) the λ ρ λ 2 and λ 3 to obtain weight of each weight training feature.

[0149] 通过上述方式计算出候选需求模版的得分,便可以按照得分从高到低的顺序对候选需求模版进行排序。 [0149] is calculated by scoring candidate template needs described above, it can be sorted in order of the candidate templates demand descending score.

[0150] 步骤S104中选取最终的需求模版时,除了会将排序位于前N4位的候选需求模版作为最终需求模版以外,还会利用排序位于前M2位的候选需求模版的边界词从排序位于前N4位之后的候选需求模版中选取最终需求模版,其中M2与N4均为正整数且M2 ^ Ν4。 When [0150] the step of selecting the final demand template S104, the addition will be sorted in the front position N4 candidates as final demand requirements other than stencil template, also located by the front boundary candidate word needs to sort M2 bits before the stencil is positioned from the ordering candidate template needs after final demand N4 bits selected template, where M2 and N4 are positive integers and M2 ^ Ν4.

[0151] 具体做法为: [0151] specific practices:

[0152] 利用关键词词典,获取与排序位于前M2位的候选需求模版的边界词对应的关键词集合,其中所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词; [0152] using a keyword dictionary, a word boundary acquired set of keywords and sorting the candidate needs located before M2 bits corresponding to the template, wherein the keyword is a word synonymous with the word boundary or the word boundary and the meet the requirements of mutual information between words;

[0153] 将排序位于前N4位之后的候选需求模版中的边界词均属于关键词集合的候选需求模版作为最终需求模版。 [0153] After sorting the word boundary in the top position N4 candidate template needs demand belonging to the candidate templates are set as the final demand keyword template.

[0154] 假设排名在前M2位以内的模版有:[城市名][公交路线]公交车路线、[地点名] 到[地点名]的公交车、[城市名]公交[公交路线],其中边界词有“公交车路线”、“到”、 “公交车”、“的”,通过关键词词典,可以得到与上述边界词对应的关键词集合为“公交/工 [0154] assume that top-ranked templates within M2 bit: [city name] [Bus routes] bus routes, [place name] to [place name] bus, [City] Bus [Bus routes], which a word boundary has "bus line", "to", "bus", "the", keyword dictionary, the word can be obtained corresponding to the boundary of the set keyword is "bus / ENGINEERING

交/工交车/公车/公共交通/公共交通线路/公共汽车/公交/公交车/公交联营车/公交路线/公交汽车/公交线/公交线路/公汽/共交/市区公交/公交车线路/的/到/到达”,则对于排名在前N4位之后的模版“到[地点名]公交车路线”而言,由于这个模版的边界词“到”与“公交车路线”均在关键词集合里,因此这个模版也可以被选取为最终模版。上述关键词词典中的关键词可通过各种现有技术得到,如挖掘同义词或互信息计算等, 在此不再详述。 AC / delivery workers / bus / public transport / public transport routes / bus / bus / bus / bus joint venture car / bus routes / buses / bus lines / bus routes / buses / co-pay / city bus / Bus line / of / to / arrival ", the template for the top-ranked position after N4" to [place name] bus route ", since the template boundary word" to "and" bus route "were key word set in, so that the template may be selected as the final stencil above keyword dictionary keyword obtained by various prior art, such as mining or synonyms mutual information calculation, not described in detail herein.

[0155] 请参考图4,图4为本发明中生成领域模版的装置的实施例的结构示意框图。 [0155] Please refer to FIG 4, a schematic block diagram showing the structure of an embodiment of the device according to the invention the field generated in FIG. 4 is a stencil. 如图4所示,该装置包括:候选需求模版获取单元201、特征提取单元202、排序单元203及选取单元204。 4, the apparatus comprising: a template acquisition unit 201 candidate demand, feature extraction unit 202, the sorting unit 203 and the selecting unit 204.

[0156] 其中候选需求模版获取单元201用于获取特定领域的候选需求模版。 [0156] wherein the candidate template acquisition unit 201 needs to acquire a particular field needs candidate templates. 优选地,候选需求模版获取单元201包括限定单元2011和泛化单元2012。 Preferably, the template acquisition unit 201 candidate needs defining unit 2011 comprises a unit 2012 and generalization.

[0157] 其中限定单元2011用于从搜索日志中选取用户搜索请求query中与预设的特定领域的限定词匹配的query,其中特定领域限定词是与特定领域相关的词语。 [0157] wherein the definition unit 2011 for selecting a user requests a search query associated with a specific preset qualifier field matches query from the search logs, which are specific areas qualifier term associated with a particular field. 泛化单元2012用于将选取的query中与预设的特定领域的槽关键词匹配的部分替换为通配符,以得到候选需求模版,其中特定领域的槽关键词是特定领域用于泛化的词语。 Generalization query unit 2012 configured to select a particular field in a predetermined portion of the groove with a wildcard matching keywords, the candidate needs to obtain a stencil, wherein the specific area of ​​the groove keywords are words for specific areas of generalization .

[0158] 进一步地,所述候选需求模版获取单元201还可包括一过滤单元,用于根据预设的对所述特定领域的槽位数量要求,从泛化单元得到的候选需求模版中过滤掉不满足槽位数量要求的候选需求模版。 [0158] Further, demand for the candidate template acquisition unit 201 may further include a filter unit, according to a preset number of requests for the particular field of the slot, the candidate needs to obtain the template from the unit filtered generalization the number of slots does not meet the requirements of a candidate needs a template.

[0159] 特征提取单元202用于提取候选需求模版的特征。 [0159] the needs of the candidate feature extraction unit 202 for extracting a feature template. 优选地,特征提取单元202包括相似度特征提取单元2021、泛化能力特征提取单元2022及边界词特征提取单元2023中的至少一种。 Preferably, the similarity of the feature extraction unit 202 includes a feature extraction unit 2021, unit 2022 and the word feature extraction border generalization at least one feature extraction unit 2023.

[0160] 其中,相似度特征提取单元2021用于提取候选需求模版的相似度特征,所述相似度特征用于描述候选需求模版与特定领域联系的紧密度。 [0160] wherein the similarity feature similarities of feature extraction unit 2021 for extracting candidate template needs, the similarity feature candidate for describing the tightness of the needs associated with a particular template field. 请参考图5,图5为本发明中相似度特征提取单元的实施例的结构示意框图。 Please refer to FIG. 5, FIG. 5 a schematic block diagram of similarities of feature extraction unit embodiment of the present invention. 如图5所示,相似度特征提取单元2021包括模版词向量生成单元2021_1、领域词向量生成单元2021_2和计算单元2021_3。 5, a similarity extracting unit 2021 comprises a template feature vector generating unit 2021_1 word, word field vector calculating unit generation unit 2021_2 and 2021_3.

[0161] 其中模版词向量生成单元2021_1用于在提取候选需求模版W的相似度特征时,获取W的核心词向量。 [0161] wherein the vector generation unit 2021_1 word template for the similarities of feature extraction candidate template needs of W, the vector W acquisition core words.

[0162] 领域词向量生成单元2021_2用于获取特定领域的核心词向量。 [0162] FIELD word vector generation unit is configured to obtain specific area 2021_2 core word vector.

[0163] 计算单元2021_2用于计算该候选需求模版的核心词向量与特定领域的核心词向量之间的相似度,并将该相似度作为W的相似度特征。 [0163] 2021_2 calculating means for calculating a similarity between the candidate vector needs stencil core word and core word vector specific areas, and the similarity as the similarity of W characteristic.

[0164] 优选地,模版词向量生成单元2021_1在获取W的核心词向量时,从W在搜索日志中覆盖的query中选取查询次数最多的N1个query,并在这N1个query从搜索引擎返回的搜索结果中确定核心词及核心词的权重,以形成W的核心词向量,其中所述N1为任意正整数。 [0164] Preferably, the template word vector generation unit 2021_1 in obtaining core word vector W, and select up query times N1 a query from the query W covered in search logs in and returned from the search engine that the N1 query determining a core word and core word search results weights vector W to form a core word, wherein said N1 is any positive integer.

[0165] 领域词向量生成单元2021_2获取特定领域的种子query的方式包括: Seed query manner [0165] FIELD word vector generation unit 2021_2 to obtain the specific field comprises:

[0166] 方式一、从特定领域包含的所有候选需求模版中选取在搜索日志中覆盖query数最多的N2个候选需求模版,并针对这N2个候选需求模版,从每个候选需求模版覆盖的query 中选择查询次数最多的M1个query作为种子query,其中N2及M1为正整数。 [0166] one embodiment, selected to cover most of the needs of query number N2 candidate templates in the search logs from all the candidate templates the specific needs in the art comprising, and for which demand N2 candidate templates, templates covering needs from each candidate query select the largest number of inquiries M1 a query as a seed query, where the N2 and M1 are positive integers.

[0167] 方式二、将预设的特定领域的槽关键词与预设的特定领域的限定词进行组合生成所述特定领域的种子query。 [0167] second approach, the qualifier field specified keywords preset grooves domain-specific preset combination to generate the domain-specific seed query. [0168] 方式三、利用方式一选择出部分种子query后,利用预设的特定领域的槽关键词词典将方式一选择出的种子query中的槽关键词替换为槽关键词词典中的其他槽关键词得到扩展的种子query ;所述部分种子query和所述扩展的种子query构成特定领域的种子query。 [0168] Three ways, the use of a selected embodiment of the seed query, using a specific field of a predetermined manner to a groove keyword dictionary selected seed keyword query is replaced with another groove groove groove keyword dictionary seed expanded keyword query; query and the portion of the expanded seed seed seed query query a specific configuration of the field.

[0169] 优选地,领域词向量生成单元2021_2可采用方式三获取特定领域的种子query。 [0169] Preferably, the vector field generating unit 2021_2 word embodiment may employ three specific areas to seeds query.

[0170] 请继续参考图4。 [0170] Please continue to refer to FIG. 4. 泛化能力特征提取单元2022,用于提取候选需求模版的泛化能力特征。 Generalization feature extraction unit 2022, the generalization ability for feature extraction candidate template needs. 所述泛化能力特征用于描述候选需求模版覆盖用户搜索请求query的能力。 The generalization ability of the candidate feature for the user to cover the needs of a search request query template is described.

[0171] 优选地,泛化能力特征提取单元2022在提取候选需求模版W的泛化能力特征时, 确定W对应的槽关键词序列,统计W对应的槽关键词序列中互异的槽关键词序列的数量并依据该数量计算W的泛化能力特征,其中W对应的一个槽关键词序列是由W在搜索日志中覆盖的一个query中的槽关键词组成的序列。 When [0171] Preferably, the generalization capability feature extraction unit 2022 extracts feature candidate needs template generalization of W, W determined sequence corresponding groove keyword, keyword sequence statistics groove W corresponding grooves mutually different keywords the number of sequences and generalization characteristics calculated according to the number of W, wherein W is a corresponding groove keyword query sequence is a sequence of W covered in a groove in the search logs keywords thereof.

[0172] 边界词特征提取单元2023,用于提取候选需求模版的边界词特征。 [0172] word boundary feature extraction unit 2023, a candidate word feature boundary extraction template needs. 所述边界词特征用于描述候选需求模版中未被泛化的词语对候选需求模版的正确性产生的影响。 Wherein the term used to describe the boundaries affect the accuracy of the generated candidate word needs not generalized template of the candidate templates needs.

[0173] 请参考图6,图6为本发明中边界词特征提取单元的实施例的结构示意框图。 [0173] Please refer to FIG. 6, a schematic block diagram of an embodiment of the present invention, the unit 6 in FIG word boundary feature extraction. 如图6所示,该实施例包括:切分单元2023_1、正负向量生成单元2023_2、模版向量生成单元2023_3及相似度计算单元2023_4。 6, this embodiment comprises: segmentation unit 2023_1, 2023_2 negative vector generation unit, the template generating unit vector similarity calculation unit 2023_3 and 2023_4.

[0174] 其中切分单元2023_1用于将特定领域包含的所有候选需求模版切分为片段。 [0174] wherein the segmentation unit 2023_1 demand for all the candidate templates cut into fragments comprising a particular field.

[0175] 正负向量生成单元2023_2用于从切分单元2023_1得到的各切分片段中选取正片段并确定正片段的权重以生成特定领域的正向量,从得到的各切分片段中选取负片段并确定负片段的权重以生成特定领域的负向量。 [0175] 2023_2 negative vector generation unit for each fragment was sliced ​​from the slicing unit 2023_1 select positive positive fragments and fragments weight to determine the weight vector to produce a specific field of positive, negative selection from each fragment was sliced ​​in fragments and fragments negative weight to determine the weight vector to generate a negative specific areas. 优选地,正负向量生成单元2023_3包括槽关键词序列确定单元2023_21及正负片段选取单元2023_22。 Preferably, the positive and negative vector generating unit 2023_3 includes a groove keyword and negative sequence determination unit 2023_21 2023_22 segment selecting unit.

[0176] 其中槽序列词确定单元2023_21用于确定各切分片段对应的槽关键词序列,其中一个切分片段对应的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query中的槽关键词组成的序列。 [0176] wherein the sequence of channel word determination means for determining a respective segmentation 2023_21 keyword sequence fragment corresponding to the groove, wherein a segmentation corresponding to a groove segment comprising a sequence of keyword candidates for the segmentation needs template fragment covered consisting of a sequence of query keywords grooves in.

[0177] 正负片段选取单元2023_22用于按照下列方式从各切分片段中选取正片段和负片段并确定正片段和负片段的权重: [0177] 2023_22 negative segment selecting unit for selecting a fragment from the positive and negative segments of the cut segment points is determined in the following manner and the positive segment and a negative segment of the right weight:

[0178] (1)如果一个切分片段对应的所有槽关键词序列相同,则将该切分片段作为负片段,且该负片段的权重为1 ; [0178] (1) If a keyword segmentation same groove all the segments corresponding to the sequence, then the segmentation fragment fragment as negative, and the negative weight of the weight of a segment;

[0179] (2)如果一个切分片段对应的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值,则将该切分片段作为负片段,且该负片段的权重为所述比例P ; [0179] (2) if a sequence of all keyword groove segments corresponding cut points are not identical, but there are a sequence of keyword slots in all slots occupied by the keyword segmentation sequence fragment ratio P is greater than the preset first a threshold, then the segmentation fragment as negative fragment, and the negative weight is the weight ratio of segment P;

[0180] (3)确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量, 得到该数量中的最大值Z1,如果一个切分片段不满足所述Tl及所述T2中的条件,且该切分片段对应的互异的槽关键词序列的数量Z2与所述Z1的比值大于预设的第二阈值,则将该切分片段作为正片段,且该正片段的权重为Z2与Z1的比值。 [0180] (3) determining a number of slots cross each keyword candidate sequence-specific needs of specific areas corresponding to the template included to give the maximum number of Z1, if a segment does not satisfy the slicing and the Tl T2, conditions, and the sliced ​​segments corresponding to the number of mutually different groove keyword sequence Z1 Z2 ratio is greater than the preset second threshold value, then the fragment as a positive sliced ​​fragment, and the fragment n weight ratio of Z1 and Z2.

[0181] 模版向量生成单元2023_3用于在提取候选需求模版W的边界词特征时,确定W的切分片段的权重并使用W的切分片段及切分片段的权重构成W的向量。 [0181] 2023_3 template vector generation unit when a border candidate extracting characteristic word template W is demand, determining the weight W of the segment segmentation and re-segmentation using weights W fragment and sub-fragments cut vector W is reconstructed. 优选地,模版向量生成单元2023_3在确定W的切分片段的权重时,统计W的切分片段在W中出现的次数,并将该次数作为对应切分片段的权重。 Preferably, the number of the template vector generating means when the weight W is determined 2023_3 slicing weight fragments, fragment parsing W statistics appearing in W, and the number of times corresponding to a weight of the heavy fragments of segmentation. [0182] 相似度计算单元2023_4用于计算W的向量与正向量的相似度S1以及W的向量与负向量的相似度s2,并根据S1与S2的差值得到W的边界词特征。 [0182] 2023_4 similarity calculation unit for calculating a vector and the vector W n of similarity S1 and s2 similarity vector and the negative vector W, and has been characterized in terms of the boundary W based on the difference of S1 and S2.

[0183] 请继续参考图4。 [0183] Please continue to refer to FIG. 4. 排序单元203用于利用特征提取单元202提取的特征对候选需求模版进行排序。 Sorting unit 203 for feature extraction unit 202 using the feature extracted by the candidate templates needs to sort. 排序单元203包括标准模版集选取单元2031、训练单元2032及计算与排序单元2033。 Sorting unit 203 includes a set of standard template select unit 2031, and the training unit 2032 calculates a sorting unit 2033.

[0184] 其中,标准模版集选取单元2031用于从候选需求模版中选取标准模版集。 [0184] wherein the set of standard template selecting unit 2031 configured to select from a set of standard template of the candidate templates needs. 请参考图7,图7为本发明中标准模版集选取单元的实施例的结构示意框图。 Please refer to FIG. 7, a schematic block diagram of an embodiment of the unit of FIG 7 of the present invention, a standard set of templates selected. 如图7所示,标准模版集选取单元2031包括模版集合确定单元2031_1和交集单元2031_2。 As shown in FIG 7, a standard set of templates includes a template selecting unit 2031 and the intersection set determining unit 2031_1 2031_2 unit. 其中模版集合确定单元2031_1,用于针对提取的每个特征基于特征值对各候选需求模版进行排序,分别针对每个特征取排列在前N3位的候选需求模版作为对应特征的模版集合,其中N3为正整数。 Wherein a set of templates unit 2031_1 determining, for each candidate template are sorted based on the needs of the feature value for each feature extraction, were taken a set of templates arranged in front N3 position as a candidate corresponding feature template requirements for each feature, wherein N3 It is a positive integer. 交集单元2031_2,用于取各特征的模版集合之间的交集作为标准模版集。 Intersection unit 2031_2, for taking the intersection between the set of templates each template set as a standard feature.

[0185] 请继续参考图4。 [0185] Please continue to refer to FIG. 4. 训练单元2032用于使用标准模版集训练提取的各特征对应的参数,将训练中使得标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重。 Wherein for each training unit 2032 using a standard set of templates corresponding to the extracted parameter training, the training set of standard template such that the template ranking parameter values ​​can not demand more forward in all the candidate templates as the corresponding feature weights.

[0186] 计算与排序单元2033用于使用特征提取单元202提取的各特征及训练单元2032 得到的各特征的权重计算候选需求模版的得分,并根据该得分对各候选需求模版进行排序。 Weight of each feature obtained 2032 [0186] Methods and sorting unit 2033 using the feature extraction unit 202 extracts features of each training unit and the score recalculated candidate needs stencil, and the stencil sorted based on the needs of each candidate score. 优选地,按照得分从高到低对各候选需求模版进行排序。 Preferably, according to sort in descending score each candidate template needs.

[0187] 选取单元204用于根据排序单元203排序的结果从候选需求模版中选取最终需求模版作为特定领域的需求模版。 [0187] a selecting unit 204 for selecting from the candidate templates the final demand requirements template according to a result of the sorting unit 203 to sort a particular field needs template. 优选地,选取单元204包括第一选取单元2041和第二选取单元2042。 Preferably, the selecting unit 204 includes a first selecting unit 2041 and a second selecting unit 2042. 其中第一选取单元2041用于将排序位于前N4位的候选需求模版选取为最终需求模版,其中N4为正整数。 Wherein a first section 2041 for selecting the sort position N4 in the top candidate template needs to select the final demand template, where N4 is a positive integer. 第二选取单元2042用于利用排序位于前M2位的候选需求模版的边界词获取关键词集合,并将排序位于前N4位之后的候选需求模版中的边界词均属于关键词集合的候选需求模版选取为最终需求模版,其中所述边界词为候选需求模版中未被泛化的词,所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词,M2为正整数且M2小于或等于N4。 Word boundary utilizing a second selecting unit 2042 needs to sort the candidate is located before the position M2 template acquiring keyword set, the sorted word boundary candidates stencil demand in the top position after the N4 belong to the set of candidate keyword template needs selected as the final demand template, wherein the boundary is a word candidate template needs not generalization words, the mutual information is a keyword satisfy the boundary between the word and synonymous words or word boundary and word, M2 and M2 is a positive integer equal to or less than N4.

[0188] 以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。 [0188] The foregoing is only preferred embodiments of the present invention but are not intended to limit the present invention, all within the spirit and principle of the present invention, any changes made, equivalent substitutions and improvements should be included within the scope of protection of the present invention.

Claims (28)

1. 一种生成领域需求模版的方法,其特征在于,所述方法包括:A.获取特定领域的候选需求模版;B.提取候选需求模版的特征,所述特征至少包括:表征候选需求模板与所述特定领域之间紧密度的相似度特征、表征候选需求模板覆盖用户搜索请求query能力的泛化能力特征以及表征候选需求模板中未泛化的词语对候选需求模板正确性所产生影响的边界词特征中的至少一种;C.利用提取的特征对候选需求模版进行排序;D.根据排序的结果从候选需求模版中选择最终需求模版作为特定领域的需求模版。 A method for generating a demand template field, wherein, the method comprising: obtaining candidate A stencil demand in specific areas; B extracts feature candidate template needs, the features comprising at least: Characterization of candidate template Demand wherein the degree of similarity between the closeness specific areas, characterized by the candidate search request user demand template overlay feature generalization capability query words, and characterization of a boundary candidate template needs not generalize the impact of the candidate template needs correctness wherein at least one word;. C using the extracted feature template needs to sort the candidate;. D selected as the final demand template according to the specific field of demand template needs to sort the results from the candidate templates.
2.根据权利要求1所述的方法,其特征在于,所述步骤A包括:Al.从搜索日志中选取用户query中与预设的所述特定领域的限定词匹配的query ; A2.将选取的query中与预设的所述特定领域的槽关键词匹配的部分替换为通配符, 得到候选需求模版。 2. The method according to claim 1, wherein said step A comprises:. Al query the user in selecting the particular field with the preset qualifier matches the query from the search logs; A2 will be selected. the query in the particular field with the preset groove portion with a wildcard matching keywords, the candidate needs to obtain the template.
3.根据权利要求2所述的方法,其特征在于,在所述步骤A2之后还包括:根据预设的对所述特定领域的槽位数量要求,从所述步骤A2得到的候选需求模版中过滤掉不满足槽位数量要求的候选需求模版。 According to a preset number of requests for the particular field of the slot, from the candidate needs stencil obtained in Step A2: 3. The method according to claim 2, wherein, after said step A2 further comprises filter out the candidate does not meet the needs of the number of slots required template.
4.根据权利要求1所述的方法,其特征在于,提取候选需求模版W的相似度特征的步骤包括:获取所述W的核心词向量及所述特定领域的核心词向量;计算所述W的核心词向量与所述特定领域的核心词向量之间的相似度,并将该相似度作为所述W的相似度特征。 4. The method according to claim 1, characterized in that the similarities of feature extraction candidate template needs W comprises: obtaining the core word vector W and the vector of the specific areas the core word; calculating the W similarity between the vector and the particular core words core word field vector, and the similarity as the similarity of the characteristic of the W.
5.根据权利要求4所述的方法,其特征在于,获取所述W的核心词向量的步骤包括: 从所述W在搜索日志中覆盖的query里选取查询次数最多的N1个query,并在所述N1个query从搜索引擎返回的搜索结果中确定核心词及核心词的权重,以形成所述W的核心词向量,其中N1为正整数。 5. The method as claimed in claim 4, wherein the step of said core word vector W includes obtaining: select a maximum number of queries from the query of the query the N1 W covered in the search logs, and the query determines the N1 and core words from the core word search engine search results returned by weight, to form the core of the W word vector, wherein N1 is a positive integer.
6.根据权利要求4所述的方法,其特征在于,获取所述特定领域的核心词向量的步骤包括:利用所述特定领域的种子query获取搜索引擎返回的搜索结果,并在该搜索结果中确定核心词及核心词的权重,以形成所述特定领域的核心词向量。 6. The method as claimed in claim 4, wherein the step of the specific areas of the core words vector comprising obtaining: Seed query to obtain the particular field of search engine search results returned in the search results and determining core word and core word weights to form the core of the domain-specific word vector.
7.根据权利要求6所述的方法,其特征在于,所述特定领域的种子query的获取方式包括:方式一、从所述特定领域包含的所有候选需求模版中选取在搜索日志中覆盖query数最多的N2个候选需求模版,并针对所述N2个候选需求模版,从每个候选需求模版覆盖的query中选择查询次数最多的M1个query作为种子query,其中N2及M1为正整数;或者,方式二、将预设的所述特定领域的槽关键词与预设的所述特定领域的限定词进行组合生成所述特定领域的种子query ;或者,方式三、利用所述方式一选择出部分种子query后,利用预设的所述特定领域的槽关键词词典将所述方式一选择出的种子query中的槽关键词替换为所述槽关键词词典中的其他槽关键词得到扩展的种子query ;所述部分种子query和所述扩展的种子query构成所述特定领域的种子query。 7. The method according to claim 6, wherein said seed specific areas Obtaining query comprises: a mode select from all the candidate templates the specific needs of the field included in the search query log covering number most N2 candidate templates needs, and demand for said candidate template N2, M1 select the most number of queries from a query needs of each candidate query templates covered query as a seed, where N2 and M1 are positive integers; or Second way, the qualifier field specified keywords preset grooves of said specific areas of a preset combination to generate the domain-specific seed Query; or three mode, a mode selected by said portion after the seed query, the domain-specific dictionary keyword groove by a predetermined manner to said selected seeds in a groove query keyword substitution seed expanded to other slots in the slot keyword dictionary keyword query; query and the portion of the expanded seed seed seed query query constituting the particular area.
8.根据权利要求1所述的方法,其特征在于,提取候选需求模版W的泛化能力特征的步骤包括:确定所述W对应的槽关键词序列,统计所述W对应的槽关键词序列中互异的槽关键词序列的数量并依据该数量计算所述W的泛化能力特征,其中所述W对应的一个槽关键词序列是由所述W在搜索日志中覆盖的一个query中的槽关键词组成的序列。 8. The method according to claim 1, wherein the extraction feature candidate needs generalization stencil W comprises: determining a corresponding slot of the W keyword sequence, the statistics corresponding slots Image sequence W the number of cross grooves sequence-specific keywords of a query and based on the calculated number of the feature generalization W, wherein W corresponding to the sequence of a groove is covered by the keyword in the search log W in keyword sequence groove thereof.
9.根据权利要求1所述的方法,其特征在于,提取候选需求模版W的边界词特征的步骤包括:将所述特定领域包含的所有候选需求模版切分为片段,从得到的各切分片段中选取正片段并确定各正片段的权重以生成所述特定领域的正向量,从得到的各切分片段中选取负片段并确定各负片段的权重以生成所述特定领域的负向量;确定所述W的切分片段的权重并使用所述W的切分片段及切分片段的权重构成所述W 的向量;计算所述W的向量与所述正向量的相似度S1,以及,所述W与所述负向量的相似度s2, 并根据所述S1与所述S2的差值得到所述W的边界词特征。 9. The method according to claim 1, wherein the feature extraction border candidate word W needs stencil comprises: cutting all candidate templates the specific needs of the field is divided into segments comprising, from the cut points obtained fragment selected positive fragments and determine the weight of each of the positive segment weights to generate the n vectors particular field, select negative fragment from each segmentation fragment obtained and determine the weight of each negative segment weights to generate the negative vector specific areas; determining the weight W is sliced ​​segments were weighed and the right to use the fragment of slicing and a slicing W reconstructed vector fragment of said W; W calculates the similarity of the vector S1 and the vector is positive, and, the vector W and the negative similarity s2, and the obtained word boundary wherein W according to the difference between the S1 and S2.
10.根据权利要求9所述的方法,其特征在于,所述特定领域的正向量和负向量的生成过程具体包括:确定各切分片段对应的槽关键词序列,其中一个切分片段对应的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query中的槽关键词组成的序列;Tl.如果一个切分片段对应的所有槽关键词序列相同,则将该切分片段作为负片段,且该负片段的权重为1 ;T2.如果一个切分片段对应的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值,则将该切分片段作为负片段,且该负片段的权重为所述比例P ;T3.确定特定领域包含的每个候选需求模版对应的互异的槽关键词序列的数量,得到该数量中的最大值Z1,如果一个切分片段不满足所述Tl及所述T2中的条件,且该切分片段对 10. The method according to claim 9, wherein the specific field of the positive and negative vector of the vector generation process comprises: determining the respective grooves Image segmentation corresponding sequence fragment, wherein a fragment corresponding to segmentation a groove keyword query sequence is a candidate for a demand of the segmentation template fragment covered by the composition comprising a groove keyword;. Tl segmentation if a segment corresponding to the same sequence of all keyword slots, then the cut fragment as negative partial fragment, and the weight of negative weights fragment 1; T2 if all slots keyword sequence corresponding to a fragment of segmentation are not identical, but there are a groove all criteria in the sequence of groove keyword segmentation fragment. sequence proportion P is greater than a first predetermined threshold value, then the segmentation fragment fragment as negative, and the negative weight is the weight ratio of segment P;. T3 is determined for each candidate template needs corresponding specific areas contained in number of cross grooves sequence-specific keywords, to obtain the maximum number of Z1, a segmentation if the condition is not satisfied fragments Tl and T2 in the, and the segmentation fragment 的互异的槽关键词序列的数量Z2与所述Z1的比值大于预设的第二阈值,则将该切分片段作为正片段,且该正片段的权重为Z2与Z1的比值。 The number of different mutual ratio Z2 of the groove keyword sequence Z1 is greater than a predetermined second threshold value, then the fragment as a positive sliced ​​fragment, and the fragment n is the ratio of the weight of Z1 and Z2.
11.根据权利要求9所述的方法,其特征在于,确定所述W的切分片段的权重的步骤包括:统计所述W的切分片段在所述W中出现的次数并将该次数作为对应切分片段的权重。 11. The method according to claim 9, wherein said determining comprises weights W segmentation step segments: the number of statistical segmentation W appearing in the segment and the number of times as W fragment corresponding to the right to re-segmentation.
12.根据权利要求1所述的方法,其特征在于,所述步骤C包括:从候选需求模版中选取标准模版集;利用所述标准模版集训练提取的各特征对应的参数,将训练中使得所述标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重;使用提取的各特征及特征的权重计算候选需求模版的得分,并根据该得分对各候选需求模版进行排序。 12. The method according to claim 1, wherein the step C comprises: selecting from the set of standard template of the candidate templates needs; wherein each of the training set using the standard template corresponding to the extracted parameter, so that the training parameter values ​​of the set template the standard template rank all candidate needs stencil not more forward right as corresponding features weight; scoring candidate needs stencil right to use the extracted each feature and feature of calculating the weight and the score of the each candidate needs to sort templates.
13.根据权利要求12所述的方法,其特征在于,从候选需求模版中选取标准模版集的步骤包括:针对提取的每个特征分别基于特征值对候选需求模版进行排序,分别针对每个特征取排列在前N3位的候选需求模版作为对应特征的模版集合,其中N3为正整数;取各特征的模版集合之间的交集作为标准模版集。 13. The method according to claim 12, wherein the selection criteria from the candidate set of templates needs templates comprises the step of: for each candidate feature extraction are sorted based on the needs of the template feature values, one for each feature N3 bit array including the preceding candidate template as a template set corresponding to the demand feature, wherein N3 is a positive integer; the intersection between the set of templates each feature taken as a standard set of templates.
14.根据权利要求1所述的方法,其特征在于,所述步骤D包括:将排序位于前N4位的候选需求模版选取为最终需求模版,其中N4为正整数;利用排序位于前礼位的候选需求模版的边界词获取关键词集合,并将排序位于前队位之后的候选需求模版中的边界词均属于所述关键词集合的候选需求模版选取为最终需求模版,其中所述边界词为候选需求模版中未被泛化的词,所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词,M2为正整数且M2小于或等于N4。 14. The method according to claim 1, wherein the step D comprises: sorting the candidate bit in the front N4 template needs to select the final demand template, where N4 is a positive integer; in the top by a sorting bit ceremony word boundary candidate template acquisition requirements set of keywords, and the word boundary in the top ranked candidate template demand force after the bits are in demand belonging to the candidate template is set to select the keyword final demand template, wherein the boundary is a word candidate template needs not generalization words, the keyword is to meet the requirements of mutual information word boundary between the words and the word or synonymous with the word boundary, M2 and M2 are positive integers less than or equal to N4.
15. 一种生成领域需求模版的装置,其特征在于,所述装置包括:候选模版获取单元,用于获取特定领域的候选需求模版;特征提取单元,用于提取候选需求模版的特征,其中所述特征提取单元至少包括相似度特征提取单元、泛化能力特征提取单元或边界词特征提取单元中的一个,所述相似度特征提取单元用于提取表征候选需求模板与所述特定领域之间紧密度的相似度特征,所述泛化能力特征提取单元用于提取表征候选需求模板覆盖用户搜索请求query能力的泛化能力特征,所述边界词特征提取单元用于提取表征候选需求模板中未泛化的词语对候选需求模板正确性所产生影响的边界词特征;排序单元,用于利用所述特征提取单元提取的特征对候选需求模版进行排序;选取单元,用于根据所述排序单元排序的结果从候选需求模版中选择最终需求模版作为 15. An apparatus needs to generate a stencil field, characterized in that said apparatus comprises: a candidate template acquiring means for acquiring a particular field needs candidate template; feature extraction means for extracting a feature candidate template demand, wherein wherein said extraction unit comprises at least a similarity feature extraction unit extracting characteristic word feature generalization boundary extraction unit or a unit, the similarity feature extraction means for extracting characterizing candidate template between the specific needs of the art and close wherein the degree of similarity, the generalization ability characterizing feature extraction unit for extracting a candidate search request user demand template overlay feature generalization capability query, the word boundary extracting means for extracting a feature characterizing the candidate template needs not pan word boundaries of words wherein the impact of the candidate templates accuracy requirements; sorting unit, for utilizing the feature extraction unit extracts the feature template needs to sort the candidate; selecting means for sorting according to the ranking unit select final demand requirements from the candidate templates as templates 定领域的需求模版。 Given the field of demand template.
16.根据权利要求15所述的装置,其特征在于,所述候选模版获取单元包括:限定单元,用于从搜索日志中选取用户query中与预设的所述特定领域的限定词匹配的query ;泛化单元,用于将所述限定单元选取的query中与预设的所述特定领域的槽关键词匹配的部分替换为通配符,得到候选需求模版。 16. Apparatus according to claim 15, wherein said candidate template obtaining unit comprising: defining means for selecting a user query in the particular field with the preset qualifier matches the query from the search logs ; generalizing unit configured to replace part of the query definition unit selected in the particular field with the preset keyword matching grooves is a wildcard, the candidate needs to obtain the template.
17.根据权利要求16所述的装置,其特征在于,所述候选模版获取单元进一步包括过滤单元,用于根据预设的对所述特定领域的槽位数量要求,从所述泛化单元得到的候选需求模版中过滤掉不满足槽位数量要求的候选需求模版。 17. The apparatus according to claim 16, wherein said candidate template obtaining unit further comprises a filter unit configured to obtain from the generalizing unit according to a preset number of requests for the particular field of slot the candidate needs to filter out candidates template template does not meet the needs of the number of slots required.
18.根据权利要求15所述的装置,其特征在于,所述相似度提取单元包括:模版词向量生成单元,用于在提取候选需求模版W的相似度特征时,获取所述W的核心词向量;领域词向量生成单元,用于获取所述特定领域的核心词向量;计算单元,用于计算所述W的核心词向量与所述特定领域的核心词向量之间的相似度,并将该相似度作为所述W的相似度特征。 18. The apparatus according to claim 15, characterized in that, the similarity extracting unit comprises: word template vector generation unit, for extracting the candidate when the similarity feature needs stencil W, W to obtain the core word vector; word vector field generating unit, configured to obtain a core word vector for the specific area; calculation unit for calculating a similarity between the word W core vectors and the specific area of ​​the core word vector, and the characteristic of the degree of similarity as the similarity of W.
19.根据权利要求18所述的装置,其特征在于,所述模版词向量生成单元从所述W在搜索日志中覆盖的query里选取查询次数最多的N1个query,并在所述N1个query从搜索引擎返回的搜索结果中确定核心词及核心词的权重,以形成所述W的核心词向量,其中所述N1 为正整数。 19. The apparatus according to claim 18, wherein said vector generation unit selected word template most queries from query query number N1 of the cover W in the search logs, and the N1 in the query and determining the core words from the core word search engine search results returned by weight, to form the core of the W word vector, wherein said N1 is a positive integer.
20.根据权利要求18所述的装置,其特征在于,所述领域词向量生成单元利用所述特定领域的种子query获取搜索引擎返回的搜索结果,并在该搜索结果中确定核心词及核心词的权重,以形成所述特定领域的核心词向量。 20. The apparatus of claim seed according to claim 18, wherein said word vector field generating unit using the domain-specific query get results returned by search engines, and determine the core word and core word in the search results the weight vector to form the core of the domain-specific word.
21.根据权利要求20所述的装置,其特征在于,所述领域词向量生成单元获取所述特定领域的种子query的方式包括:方式一、从所述特定领域包含的所有候选需求模版中选取在搜索日志中覆盖query数最多的N2个候选需求模版,并针对所述N2个候选需求模版,从每个候选需求模版覆盖的query中选择查询次数最多的M1个query作为种子query,其中N2及M1为正整数;或者,方式二、将预设的所述特定领域的槽关键词与预设的所述特定领域的限定词进行组合生成所述特定领域的种子query ;或者,方式三、利用所述方式一选择出部分种子query后,利用预设的所述特定领域的槽关键词词典将所述方式一选择出的种子query中的槽关键词替换为所述槽关键词词典中的其他槽关键词得到扩展的种子query ;所述部分种子query和所述扩展的种子query构成所述特定领域的种子query。 21. The apparatus according to claim 20, wherein said word vector field generating unit acquires the specific embodiment query field seed comprising: a mode select from all the candidate templates the specific needs in the art comprising search logs cover the largest number N2 candidate query templates needs, and demand for said candidate template N2, M1 select the most number of queries from a query needs of each candidate query templates covered query as a seed, and wherein N2 M1 is a positive integer; or mode two, the qualifier field specified keywords preset grooves of said specific areas of a preset combination to generate the domain-specific seed Query; or third approach, using after a selected portion of said seed mode query, the domain-specific dictionary keyword groove by a predetermined manner to said selected seeds in a groove keyword query keyword dictionary is replaced with the slot in the other seed keywords grooves expanded query; query and the portion of the expanded seed seed seed query configured to query a specific area.
22.根据权利要求15所述的装置,其特征在于,所述泛化能力特征提取单元在提取候选需求模版W的泛化能力特征时,确定所述W对应的槽关键词序列,统计所述W对应的槽关键词序列中互异的槽关键词序列的数量并依据该数量计算所述W的泛化能力特征,其中所述W的一个槽关键词序列是由所述W在搜索日志中覆盖的一个query中的槽关键词组成的序列。 22. The apparatus according to claim 15, wherein said feature extraction unit generalization generalization when extracting feature candidate templates W needs to determine the sequence of keywords W corresponding grooves, said Statistics keyword sequence number of the grooves grooves keyword sequence corresponding to mutually different W and calculated according to the number of said feature generalization W, wherein a sequence of said groove keyword W is determined by the search log W consisting of a sequence of query keywords groove in the cover.
23.根据权利要求15所述的装置,其特征在于,所述边界词特征提取单元包括: 切分单元,用于将特定领域包含的所有候选需求模版切分为片段;正负向量生成单元,用于从所述切分单元得到的各切分片段中选取正片段并确定正片段的权重以生成所述特定领域的正向量,从得到的各切分片段中选取负片段并确定各负片段的权重以生成所述特定领域的负向量;模版向量生成单元,用于在提取候选需求模版W的边界词特征时,确定所述W的切分片段的权重并使用所述W的切分片段及切分片段的权重构成所述W的向量;相似度计算单元,用于计算所述W的向量与所述正向量的相似度S1,以及,所述W的向量与所述负向量的相似度S2,并根据所述S1与所述S2的差值得到所述W的边界词特征。 23. The apparatus according to claim 15, wherein said word boundary feature extraction unit comprises: segmentation means for all the candidate templates needs to cut into fragments comprising a particular area; negative vector generation unit, for each of the sliced ​​cut from the segment select sub units derived fragments and to determine the weight n n fragments to generate the specific weight of the positive field vector, a fragment from each of the selected segmentation negative fragment obtained is determined and each negative fragment weights to generate the domain-specific negative vector; template vector generating means for extracting when a border candidate word feature template needs of W, the weight W is determined segmentation and re-use segment of the segment W is sliced segmentation and right segments of the reconstructed vector W; a similarity calculating unit, for calculating a degree of similarity S1 of the similarity vector W with the positive vector, and the vector W and the negative vector of S2, and the obtained word boundary wherein W according to the difference between the S1 and S2.
24.根据权利要求23所述的装置,其特征在于,所述正负向量生成单元包括:槽关键词序列确定单元,用于确定各切分片段对应的槽关键词序列,其中一个切分片段对应的一个槽关键词序列是包含该切分片段的一个候选需求模版所覆盖的一个query 中的槽关键词组成的序列;正负片段选取单元,用于按照下列方式从各切分片段中选取正片段和负片段以及确定正片段和负片段的权重:Tl.如果一个切分片段对应的所有槽关键词序列相同,则将该切分片段作为负片段,且该负片段的权重为1 ;T2.如果一个切分片段对应的所有槽关键词序列不完全相同,但存在一个槽关键词序列在该切分片段的所有槽关键词序列中占的比例P大于预设的第一阈值,则将该切分片段作为负片段,且该负片段的权重为所述比例P ;T3.确定特定领域包含的每个候选需求模版对应的互异的槽 24. The apparatus according to claim 23, wherein said positive and negative vector generation unit comprises: a groove keyword sequence determination unit for determining the sequence of each groove Image Segmentation segments corresponding, wherein a segmentation fragment a groove corresponding to a keyword query sequence is a template of the candidate segmentation needs fragment covered by the composition comprising a groove keyword; negative segment selecting means for selecting from each segmentation segments in the following manner positive segment and a negative segment and determine the weight of a positive segment and a negative segment weight:. Tl If a segmentation all slots keyword sequence fragments corresponding to the same, then the segmentation fragment as negative fragment, and the negative weight fragment weight of 1; T2. If a partial sequence of all keyword segments corresponding groove cut not identical, but there are a sequence of keyword slots in all slots occupied by the keyword segmentation sequence fragment ratio P is greater than a first predetermined threshold value, the sliced ​​fragment was used as a negative segment, and the negative weight is the weight ratio of segment P; template corresponding to each candidate needs specific areas included in mutually different groove T3 is determined. 关键词序列的数量,得到该数量中的最大值Z1,如果一个切分片段不满足所述Tl及所述T2中的条件,且该切分片段对应的互异的槽关键词序列的数量Z2与所述Z1的比值大于预设的第二阈值,则将该切分片段作为正片段,且该正片段的权重为Z2与Z1的比值。 Sequence number of keywords, the number of obtained maximum value Z1, if a sub-segment does not satisfy the Tl and T2 in the cutting conditions, and the sliced ​​segments corresponding to mutually different groove keyword sequence Z2 Z1 and the ratio greater than a preset second threshold value, then the fragment as a positive sliced ​​fragment, and the weight ratio of the positive segment weight Z2 and Z1.
25.根据权利要求23所述的装置,其特征在于,所述模版向量特征生成单元在确定所述W的切分片段的权重时,统计所述W的切分片段在所述W中出现的次数并将该次数作为对应切分片段的权重。 25. The apparatus according to claim 23, wherein the template feature vector generating unit determines the weight W is sliced ​​segments weight W of the statistical cut partial fragments appear in the W, and the number of times corresponding to the number of re-segmentation as a weight fragments.
26.根据权利要求15所述的装置,其特征在于,所述排序单元包括:标准模版集选取单元,用于从候选需求模版中选取标准模版集;训练单元,用于利用所述标准模版集训练提取的各特征对应的参数,将训练中使得所述标准模版集中的模版在所有候选需求模版中的排名无法更靠前时的参数值作为对应特征的权重;计算与排序单元,用于使用所述特征提取单元提取的各特征及所述训练单元得到的各特征的权重计算候选需求模版的得分,并根据该得分对候选需求模版进行排序。 26. The apparatus according to claim 15, wherein said sorting means comprises: a standard set of template selecting unit for selecting from a set of standard template of the candidate templates needs; training unit, for using the set of standard template each feature parameter corresponding to the extracted training, the training set of standard template such that the template needs rank all candidate templates the parameter values ​​can not be closer to the top of a corresponding feature weights; calculating a sorting unit for use the feature extraction unit extracts features of each of features and weights of the training unit requirements resulting recomputed candidate template score, and ranks the evaluated candidate templates based on the needs of the score.
27.根据权利要求26所述的装置,其特征在于,所述标准模版集选取单元包括:模版集合确定单元,用于针对提取的每个特征基于特征值对候选需求模版进行排序, 分别针对每个特征取排列在前队位的候选需求模版作为对应特征的模版集合,其中N3为正整数;交集单元,用于取各特征的模版集合之间的交集作为标准模版集。 27. The apparatus according to claim 26, wherein said set of standard template selecting unit comprises: means determining a set of templates, the template needs to be sorted candidate based on a feature value for each feature extraction, respectively, for each demand candidate template feature array including a first bit team corresponding features a set of templates, wherein N3 is a positive integer; means the intersection, the intersection between the set of templates for each feature taken as a standard set of templates.
28.根据权利要求15所述的装置,其特征在于,所述选取单元包括:第一选取单元,用于将排序位于前N4位的候选需求模版选取为最终需求模版,其中N4 为正整数;第二选取单元,用于利用排序位于前M2位的候选需求模版的边界词获取关键词集合, 并将排序位于前N4位之后的候选需求模版中的边界词均属于所述关键词集合的候选需求模版选取为最终需求模版,其中所述边界词为候选需求模版中未被泛化的词,所述关键词是与所述边界词同义的词或与所述边界词之间的互信息满足要求的词,M2为正整数且M2小于或等于N4。 28. The apparatus according to claim 15, wherein said selecting unit comprises: a first selecting means for sorting the candidates in the top position N4 is selected final demand needs stencil template, where N4 is a positive integer; a second selecting unit, the demand for the candidate M2 in the top position by a sorting template acquiring a set of keywords in a word boundary, the sorted candidate word needs at a boundary position after a stencil before N4 belong to the set of candidates for the keyword template selected as the final demand requirements template, wherein the boundary is a word candidate template needs not generalization words, the keyword is a mutual information between the word and the word boundary or the word boundary synonymous with meet the requirements of the word, M2 and M2 is a positive integer equal to or less than N4.
CN 201110308830 2011-10-12 A method and apparatus for generating a demand for a stencil art CN102368260B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110308830 CN102368260B (en) 2011-10-12 A method and apparatus for generating a demand for a stencil art

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110308830 CN102368260B (en) 2011-10-12 A method and apparatus for generating a demand for a stencil art

Publications (2)

Publication Number Publication Date
CN102368260A true true CN102368260A (en) 2012-03-07
CN102368260B CN102368260B (en) 2016-12-14

Family

ID=

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136221A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method capable of generating requirement template and requirement identification method and device
CN103823809A (en) * 2012-11-16 2014-05-28 百度在线网络技术(北京)有限公司 Query phrase classification method and device, and classification optimization method and device
CN105183721A (en) * 2015-08-13 2015-12-23 小米科技有限责任公司 Template construction method, and information extraction method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6516312B1 (en) * 2000-04-04 2003-02-04 International Business Machine Corporation System and method for dynamically associating keywords with domain-specific search engine queries
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN101216853A (en) * 2008-01-11 2008-07-09 孟小峰 Intelligent web enquiry interface system and its method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6516312B1 (en) * 2000-04-04 2003-02-04 International Business Machine Corporation System and method for dynamically associating keywords with domain-specific search engine queries
CN1514387A (en) * 2002-12-31 2004-07-21 中国科学院计算技术研究所 Sound distinguishing method in speech sound inquiry
CN101216853A (en) * 2008-01-11 2008-07-09 孟小峰 Intelligent web enquiry interface system and its method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘亮亮等: "基于查询模板的特定领域中文问答系统的研究与实现", 《江苏科技大学学报(自然科学版)》, vol. 25, no. 2, 15 April 2011 (2011-04-15), pages 163 - 168 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136221A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method capable of generating requirement template and requirement identification method and device
CN103823809A (en) * 2012-11-16 2014-05-28 百度在线网络技术(北京)有限公司 Query phrase classification method and device, and classification optimization method and device
CN105183721A (en) * 2015-08-13 2015-12-23 小米科技有限责任公司 Template construction method, and information extraction method and device

Similar Documents

Publication Publication Date Title
US20070112838A1 (en) Method and system for classifying media content
Varadarajan et al. A system for query-specific document summarization
US20130254209A1 (en) Consensus search device and method
Nguyen et al. Learning to extract form labels
CN101364239A (en) Method for auto constructing classified catalogue and relevant system
CN101609450A (en) Web page classification method based on training set
CN101059806A (en) Word sense based local file searching method
CN101251841A (en) Method for establishing and searching feature matrix of Web document based on semantics
CN101373532A (en) FAQ Chinese request-answering system implementing method in tourism field
Bouros et al. Spatio-textual similarity joins
US20150088894A1 (en) Producing sentiment-aware results from a search query
CN101377777A (en) Automatic inquiring and answering method and system
CN102270234A (en) An image search method and search engines
CN101853272A (en) Search engine technology based on relevance feedback and clustering
Zhang et al. Narrative text classification for automatic key phrase extraction in web document corpora
US8515731B1 (en) Synonym verification
CN101819578A (en) Retrieval method, method and device for establishing index and retrieval system
CN101620596A (en) Multi-document auto-abstracting method facing to inquiry
CN1158460A (en) Multiple languages automatic classifying and searching method
CN101620625A (en) Method, device and search engine for sequencing searching keywords
US20130013612A1 (en) Techniques for comparing and clustering documents
CN103902652A (en) Automatic question-answering system
CN103425687A (en) Retrieval method and system based on queries
CN102087669A (en) Intelligent search engine system based on semantic association
KR20060122276A (en) Relation extraction from documents for the automatic construction of ontologies

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model