CN102081601B - Field word identification method and device - Google Patents

Field word identification method and device Download PDF

Info

Publication number
CN102081601B
CN102081601B CN 200910241287 CN200910241287A CN102081601B CN 102081601 B CN102081601 B CN 102081601B CN 200910241287 CN200910241287 CN 200910241287 CN 200910241287 A CN200910241287 A CN 200910241287A CN 102081601 B CN102081601 B CN 102081601B
Authority
CN
China
Prior art keywords
field
keyword
word
search results
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 200910241287
Other languages
Chinese (zh)
Other versions
CN102081601A (en
Inventor
于亮
张宇峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Original Assignee
Beijing Kingsoft Software Co Ltd
Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Software Co Ltd, Beijing Jinshan Digital Entertainment Technology Co Ltd filed Critical Beijing Kingsoft Software Co Ltd
Priority to CN 200910241287 priority Critical patent/CN102081601B/en
Publication of CN102081601A publication Critical patent/CN102081601A/en
Application granted granted Critical
Publication of CN102081601B publication Critical patent/CN102081601B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention discloses a field word identification method and device. In the scheme provided by the embodiment of the invention, a search engine serves as the basis, and the field key word of a field to which a field word to be identified possibility belongs is determined according to the search result of the field word to be identified by the search engine; the score of the field word to be identified, which belongs to the field, is calculated according to the information of the pre-determined field key words and the search result; the score is compared with the field conformity threshold value of the field; and according to a comparison result, whether the field word to be identified belongs to the field is determined. The scheme provided by the embodiment obtains linguistic data having great correlation degree with the field word to be identified by using the characteristics of the search engine, thereby greatly improving the identification speed and accuracy of the field word.

Description

A kind of field word identification method and device
Technical field
The present invention relates to information identification field, relate in particular to a kind of field word identification method and device.
Background technology
The field word refers to have the feature word of strong text representation function, content characteristic (such as field classification, theme, the central meaning etc.) distinctness of text can be showed.Can be divided into according to the field rate of filtration of word field general term and field word belonging to a category again in the domanial words.
The field general term is the basic word in expression field, has represented the centroid feature in such field, such as " match, the team " of sport category; Field word depth of indexing specificity belonging to a category is strong, the difference degree is high, the detailed features in field can be made a distinction, then not only can with the making a distinction of sport category and other classes, group such as the boxing of sport category inside can also be distinguished such as " WBA, the boxing champion " in the sport category.
The field word has stronger mark sheet expressivity for field under it.In Chinese information processing, the field word has great significance for work such as text classification, information retrieval, Index Transform of Topic Words.At present, the field word has had reasonable application in the text classification field.For the performance of text classification, the selection of text feature and text representation become most important point of penetration.Experiment shows, based on can having to text the text feature selection method of field word and the key phrase of strong representation function, improves a lot for the effect of text classification.The field word is for information retrieval, and particularly vertical search also can improve a lot aspect the accuracy rate of return results.
Application to the field word is based upon on exactly that the field word is the identified basis.The identification of field word (or terminology extraction) mainly contains the method that Rule-based method, the method based on statistics, rule and statistics combine at present.
The inventor is by finding prior art research, Rule-based method is to have utilized philological rule to carry out identification and the extraction of term in fact, because linguistic rules is difficult to find, especially in today of internet high development, expression way becomes more diverse, linguistic rules is more difficult seeking just, mainly be to utilize manually to find philological rule at present, and then use in the Computer Automatic Recognition, this method is so that the recognition speed of field word is low, get the development that seriously lags behind information, and its accuracy also is restricted the restriction of the artificial degree of awareness; Method based on statistics depends on the performance of algorithm model and the quantity of information that corpus of text provides to the recognition performance of field word.Although the optimization of model and algorithm can improve the performance of identification to a certain extent, because corpus of text often contains the Field Words in a plurality of fields, this intercrossing brings very large difficulty can for the identification of field word.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of field word identification method and device, and the field word is identified fast and accurately.
For achieving the above object, the embodiment of the invention provides following technical scheme:
A kind of field word identification method comprises:
Search field to be identified word in search engine obtains the sub-result in the Search Results and records each height result and the position occurs;
Determine the field keyword that occurs among the sub-result of described Search Results in conjunction with predetermined field key word information, described field key word information comprises the weights in field keyword and this field keyword field under it;
Belong to the score in the corresponding field of described field keyword according to the described field to be identified of the calculation of parameter word of described field keyword, the parameter of described field keyword comprises position and the occurrence number that described field keyword occurs in each height result;
Relatively in described score and predetermined field degree of conformity threshold value, determine that according to comparative result described field to be identified word belongs to field corresponding to keyword, described field.
Alternatively, the parameter of described field keyword also comprises: the length of described field keyword.
Alternatively, the parameter of described field keyword also comprises: the weights of described field keyword.
The score that belongs to keyword corresponding field in described field according to the described field to be identified of the calculation of parameter of each field keyword word comprises:
Calculate the score that described field to be identified word belongs to the corresponding field of described field keyword according to following formula:
Score ( P ) = 1 m Σ q = 1 m Σ i = 1 k Weight ( CW i ) * w iq * f iq * ln ( l i + α ) ,
w iq = ln ( m - q m + β ) q = y , y ∈ 1 ~ m λ f iq q = - 1 ,
Wherein: P is Search Results corresponding to field to be identified word; Score (P) belongs to the score in the corresponding field of described field keyword for field to be identified word; Weight (CW i) be the weights of i field keyword; M is the sub-result's that records number; K is the field keyword number that occurs among the sub-result; w IqIn Search Results, i field keyword is at the weights of q position; f IqBe that i field keyword is in the occurrence number of q position; l iThe word that is i field keyword is long; α is the long constant of adjusting of word; λ is weighting constant; β adjusts constant, when the field keyword is arranged in the relevant search zone of Search Results, and q=-1; Otherwise q=y, y ∈ 1~m, y are the sub-result's at keyword place, field the particular location in the Search Results text.
Definite process of described field key word information comprises:
Select N field word in the field, in search engine, described N field word searched for, obtain N Search Results;
Record the sub-result in each Search Results and record each height result and the position occurs;
From N Search Results, choose M field keyword, go out the weights that each keyword belongs to the corresponding field of field keyword according to the calculation of parameter of each field keyword.
When the parameter of described field keyword comprises position, the number of times of appearance and the length of this keyword that each field keyword occurs in each Search Results, go out the weights that each field keyword belongs to the corresponding field of field keyword according to the calculation of parameter of described field keyword and comprise:
Calculate the weights that each field keyword belongs to its corresponding field according to following formula:
Weight ( CW i ) = 1 n Σ j = 1 n 1 m Σ q = 1 m w ijq * f ijq * ln ( l i + α ) ,
w ijq = ln ( m - q m + β ) q = y , y ∈ 1 ~ m λ f ijq q = - 1 ,
Wherein: CW iBe i field keyword; Weight (CW i) be field keyword CW iWeights; N is the number of Search Results; M is the number of the sub-result in each Search Results text of record; Q is sub-result residing position in Search Results at keyword place, field; w IjqBe the weights of i field keyword q position in j Search Results; f IjqIt is the occurrence number of i field keyword q position in j Search Results; When the field keyword is arranged in the relevant search zone of Search Results, q=-1; Otherwise q=y, y ∈ 1~m, y are the sub-result's at keyword place, field the particular location in the Search Results text; l iThe word that is i field keyword is long; α is for adjusting constant; λ is weighting constant; β is for adjusting constant.
Definite process of described field degree of conformity threshold value comprises:
Parametric statistics according to each field keyword goes out the score that each Search Results belongs to the corresponding field of field keyword;
The score that belongs to the corresponding field of field keyword according to N Search Results of statistics is determined the field degree of conformity threshold value in keyword correspondence field, field.
When the parameter of described field keyword comprises position, the number of times of appearance, the weights of this field keyword and the length of this keyword that each field keyword occurs in each Search Results, go out the score that each Search Results belongs to the corresponding field of field keyword according to the parametric statistics of described field keyword and comprise:
Calculate the score that each Search Results belongs to the corresponding field of field keyword according to following formula:
Score ( P x ) = 1 m Σ q = 1 m Σ i = 1 k Weight ( CW i ) * w xiq * f xiq * ln ( l i + α ) , ,
w xiq = ln ( m - q m + β ) q = y , y ∈ 1 ~ m λ f iq q = - 1 , Wherein, P xBe x the result of page searching that the field word is corresponding; Score (P x) be the classification score of result of page searching corresponding to x field word; Weight (CW i) be the weights of i field keyword; M is result of page searching neutron result's number; K is the field keyword number that occurs among the sub-result; w XiqFor in result of page searching corresponding to x field word, i field keyword is at the weights of q position; f XipBe the occurrence number of i field keyword in result of page searching corresponding to x field word in the p position; When the field keyword is arranged in the relevant search zone of Search Results, q=-1; Otherwise q=y, x ∈ 1~m, y are the sub-result's at keyword place, field the particular location in the Search Results text; l iThe word that is i field keyword is long; α is for adjusting constant; λ is weighting constant; β is for adjusting constant.
A kind of field word recognition device comprises:
Acquiring unit is used at search engine search field to be identified word, obtains the sub-result in the Search Results and record each height result the position to occur;
Analytic unit is used for determining the field keyword that the sub-result of described Search Results occurs in conjunction with predetermined field key word information that described field key word information comprises the weights in field keyword and this field keyword field under it;
Computing unit, the described field to be identified of the calculation of parameter word that is used for the field keyword that occurs according to described sub-result belongs to the score in the corresponding field of described field keyword, and the parameter of described field keyword comprises position and the occurrence number that described field keyword occurs in each height result;
Evaluation unit is used for comparing in described score and predetermined field degree of conformity threshold value, determines that according to comparative result described field to be identified word belongs to field corresponding to keyword, described field.
Alternatively, the parameter of described field keyword also comprises: the length of described field keyword.
Alternatively, the parameter of described field keyword also comprises: the weights of described field keyword.
Described analytic unit comprises:
The search subelement for N field word selecting in advance a field, is searched for described N field word in search engine, obtains N Search Results;
The record subelement is used for the sub-result of each Search Results of record and records each height result the position occurring;
The first computation subunit is used for choosing M field keyword from N Search Results, goes out the weights that each keyword belongs to keyword correspondence field, field according to the calculation of parameter of each field keyword;
First determines subelement, is used for determining the field keyword that the sub-result of described Search Results occurs in conjunction with the field keyword.
Described evaluation unit comprises:
The second computation subunit is used for going out the score that each Search Results belongs to the corresponding field of field keyword according to the parametric statistics of each field keyword;
Second determines subelement, determines the field degree of conformity threshold value in the corresponding field of field keyword for the score that belongs to keyword corresponding field in field according to N the Search Results of adding up;
The 3rd determines subelement, is used for determining that described field to be identified word belongs to field corresponding to keyword, described field when described score reaches field degree of conformity threshold value.
As seen, in embodiments of the present invention, search field to be identified word in search engine obtains the sub-result in the Search Results and records each height result and the position occurs; Determine the field keyword that occurs among the sub-result of described Search Results in conjunction with predetermined field key word information, described field key word information comprises the weights in field keyword and this field keyword field under it; The position and the number of times that occur in each height result according to weights and the described field keyword of the field keyword that occurs among the described sub-result calculate the score that described field to be identified word belongs to the corresponding field of described field keyword; When described score reaches predetermined field degree of conformity threshold value, determine that described field to be identified word belongs to field corresponding to keyword, described field.The scheme that the embodiment of the invention provides is take search engine as the basis, may belong to the field keyword in field from search engine to determining field to be identified word the Search Results of field to be identified word, and calculate the score that described field to be identified word belongs to this field according to the information of predetermined these field keywords in conjunction with Search Results, with the field degree of conformity threshold ratio in this score and this field, determine according to comparative result whether field to be identified word belongs to this field.The scheme that the embodiment of the invention provides utilizes the characteristics of search engine itself to obtain the language material very large with the field to be identified word degree of correlation, has greatly improved speed and the accuracy of field word identification.
Description of drawings
In order to be illustrated more clearly in the technical scheme in the embodiment of the invention, the accompanying drawing of required use was done to introduce simply during the below will describe the embodiment of the invention, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
The process flow diagram of the method that Fig. 1 provides for one embodiment of the invention;
The first half of a step sectional drawing in the method that Fig. 2-1 provides for one embodiment of the invention;
The latter half of a step sectional drawing in the method that Fig. 2-2 provides for one embodiment of the invention;
The method flow diagram that Fig. 3 provides for another embodiment of the present invention;
Fig. 4 is model structure synoptic diagram corresponding to the method that one embodiment of the invention provides;
Fig. 5 is the process flow diagram with model corresponding method shown in Figure 4;
The synoptic diagram of a function that uses in the method that Fig. 6 provides for one embodiment of the invention;
The synoptic diagram of another function that uses in the method that Fig. 7 provides for one embodiment of the invention;
The result schematic diagram of the device that Fig. 8 provides for one embodiment of the invention;
The structural representation of a unit in the device that Fig. 9 provides for one embodiment of the invention;
The structural representation of another unit in the device that Figure 10 provides for one embodiment of the invention.
Embodiment
Fig. 1 is a kind of field word identification method that one embodiment of the invention provides, and comprising:
S101 searches for field to be identified word in search engine, obtain the sub-result in the Search Results and record each height result the position to occur.
The method that the embodiment of the invention provides utilizes existing search engine that field to be identified word is identified.
When a content to be checked is keyed in search engine as search word, by search engine search word is searched for, can obtain the information with relevance maximum to be checked, this information comprises: language material.Therefore, after we key in search engine to field to be identified word as search word, by search engine field to be identified word is searched for, the information that obtains is the information larger with field to be identified word correlation.
Generally speaking, in the identifying to a field to be identified word, can't determine field under it by field to be identified word itself, still can judge field under the word of field to be identified by following with the larger information of field to be identified word correlation.
The method that the embodiment of the invention provides is not done any restriction to concrete search engine, for example can be in google search engine, Baidu's search engine or other search engines any one.
Below for the convenience on describing, as an example of the google search engine example the method that the embodiment of the invention was provided is described.
For example, we input a field to be identified word in the google search engine be " sharp green pepper shredded meat ", and the result who returns after utilizing the google search engine to search for is shown in Fig. 2-1 and 2-2.Fig. 2-1 is the first half in the Search Results, and Fig. 2-2 is the latter half of Search Results.Can know in conjunction with Fig. 2-1 and Fig. 2-2, by search engine field to be identified word is searched for, the search result content that obtains is a lot, from Fig. 2-2, can find out, this return results comprises 10 pages content at least, comprise again a lot of sub-clauses and subclauses in every page, but, it seems on the whole, Search Results comprises two parts content, a part is exactly the Search Results text that accounts for each result of page searching most contents, comprise a lot of sub-clauses and subclauses in each text, for example first sub-clauses and subclauses " [picture and text] sharp green pepper shredded meat-green pepper-cuisines " and corresponding content of text " raw material ... soy sauce one little spoon ... put into after stir-frying evenly ... " thereof in the Search Results among Fig. 2-1 at once.Another part is exactly the relevant search part of each searched page below, as having " sharp green pepper shredded meat way ", " how sharp green pepper shredded meat is cooked " etc. in the part of relevant search among Fig. 2-2.For convenience, in the embodiment of the invention every sub-clauses and subclauses in the Search Results text are designated as the sub-result of each Search Results.
When utilizing described engine that each search word is searched for, the Search Results that no matter returns only has one page still to have 10 pages, and all these contents all are the Search Results of this search word.Principle according to search engine, more forward with the larger position of search word correlativity row, in order to improve processing speed, in follow-up description the embodiment of the invention with Search Results in the foundation implemented as the embodiment of the invention of the content of first page, but not limiting any content of utilizing in the practical application in the Search Results implements the present invention.
Referring to Fig. 2-1 and 2-2, after " sharp green pepper shredded meat " this field to be identified word is searched for, 10 strip results are arranged in the Search Results text of the Search Results first page that obtains, 5 relevant sub-results are also arranged in the relevant search of Search Results first page simultaneously.After obtaining this a little result, record one by one this a little result's content and the position that they occur, being specially every height result appears at and still appears at the relevant search part in the text, if there is in text, among 10 sub-results which be this sub-result come so, and these contents all will be recorded.
For example take above-mentioned article one content as example, this content is " [picture and text] sharp green pepper shredded meat-green pepper-cuisines " and corresponding content of text " raw material ... soy sauce one little spoon ... put into after stir-frying evenly ... " thereof, and the position is text, first at once.
S102 determines the field keyword that occurs among the sub-result of described Search Results in conjunction with predetermined field key word information, and described field key word information comprises the weights in field keyword and this field keyword field under it.
By search engine field to be identified word is being searched for, and after obtaining relevant information in the Search Results, just can determined the field keyword that occurs among the sub-result of Search Results in conjunction with predetermined field key word information.
The field keyword is the significant vocabulary in a field.Take culinary area as example, the field keyword in this field comprises: name of the dish, way, recipe, food, western-style food, menu, name of the dish complete works etc.The field key word information comprises that field keyword and this field keyword are at the weights in this field.For example, by in advance statistical computation, the weights of " way " this field keyword in this field of culinary art are about 1.46; " food " this field keyword is about 0.04 etc. at the weights of culinary art in this field.
In the method that the embodiment of the invention provides, if determine whether a field to be identified word belongs to the A field, so generally need to pre-determine the field key word information in A field, namely determine field keyword and the weights of this keyword in the A field in A field.
S103, belong to the score in the corresponding field of described field keyword according to the described field to be identified of the calculation of parameter word of the field keyword that occurs among the described sub-result.
In the embodiment of the invention, the parameter of field keyword is to calculate the key that this field to be identified word belongs to the corresponding field of described field keyword.The parameter of field keyword comprises position and the occurrence number that described field keyword occurs in each height result.For the field keyword of the distinguishing same field difference to this field influence degree, in an embodiment of the present invention, the parameter of this field keyword also comprises the weights of this field keyword.In another embodiment, for the impact on field under it of the length that embodies field keyword self, the parameter of this field keyword also comprises the length of field keyword.
The adopting parameters of field keyword can determine according to actual needs that the present invention does not do restriction to this.
S104, when described score reaches predetermined field degree of conformity threshold value, determine that described field to be identified word belongs to field corresponding to keyword, described field.
In the embodiment of the invention, the score that field to be identified word belongs to the corresponding field of field keyword is to judge whether field to be identified word belongs to the foundation in field corresponding to field keyword, concrete deterministic process is field to be identified word to be belonged to the score in A field and the field degree of conformity threshold value in A field compares, if the score of field to be identified word has reached the field degree of conformity threshold value in A field, field so to be identified word just belongs to the A field, otherwise does not just belong to the A field.
The method that the embodiment of the invention provides is take search engine as the basis, from search engine to determining the field keyword in the A field that field to be identified word may belong to the Search Results of field to be identified word, and calculate the score that described field to be identified word belongs to the A field according to the information of these field keywords in conjunction with Search Results, with the field degree of conformity threshold ratio in this score and A field, determine according to comparative result whether field to be identified word belongs to field A.The method that the embodiment of the invention provides utilizes the characteristics of search engine itself to obtain the language material very large with the field to be identified word degree of correlation, has greatly improved speed and the accuracy of field word identification.
Can know by top description, when we need to utilize search engine a word to be identified is searched for, when judging according to Search Results whether a field to be identified word belongs to the A field, need to pre-determine field key word information and the A field degree of conformity threshold value in A field, in order to determine further in conjunction with the result of search engine search whether a field to be identified word belongs to the A field.Below in conjunction with embodiment shown in Figure 3 the method that the embodiment of the invention provides is described in detail.Among the embodiment shown in Figure 3, we utilize the Search Results of google search engine that " sharp green pepper shredded meat " this word is identified, and judge whether it is the field word in " culinary art " field, and the method comprises:
S301, " sharp green pepper shredded meat " input google search engine is searched for, got the content (referring to Fig. 2-1 and Fig. 2-2) of first page in the Search Results as the foundation of this identification.
S302 obtains sub-results' all in the Search Results first page content, the zone of appearance and the position that occurs.
Wherein, the zone of appearance comprises Search Results body part and Search Results relevant search part.In the embodiment of the invention, for the accuracy that raising information is chosen, can carry out some processing to search result web page, for example remove webpage label and some noise words.Noise word in the embodiment of the invention comprises url that contains in " snapshots of web pages ", " similar webpage " and the webpage etc.
S303, the field keyword that pre-determines this field of culinary art and the weights of described field keyword.
The field keyword of the culinary area that current search occurs among the sub-result of first page as a result determined in the field keyword of S304, contrast culinary area.
The described field to be identified of the length computation word of position, occurrence number and this field keyword that the field keyword of S305, the culinary area that occurs among the sub-result of first page as a result according to current search occurs in each height result-sharp green pepper shredded meat belongs to the score of the corresponding field-culinary area of described field keyword.
In the embodiment of the invention, the parameter of field keyword comprises the length of position, occurrence number and this field keyword that the field keyword occurs in each height result, can calculate the score that described field to be identified word belongs to the corresponding field of described field keyword according to formula 1 and formula 2.
Score ( P ) = 1 m Σ q = 1 m Σ i = 1 k Weight ( CW i ) * w iq * f iq * ln ( l i + α ) , Formula 1
w iq = ln ( m - q m + β ) q = y , y ∈ 1 ~ m λ f iq q = - 1 Formula 2
Wherein: P is Search Results corresponding to field to be identified word; Score (P) belongs to the score in the corresponding field of described field keyword for field to be identified word; Weight (CW i) be the weights of i field keyword; M is the sub-result's that records number; K is the field keyword number that occurs among the sub-result; w IqIn Search Results, i field keyword is at the weights of q position; f IqBe that i field keyword is in the occurrence number of q position; l iThe word that is i field keyword is long; α is the long constant of adjusting of word; λ is weighting constant; β is for adjusting constant.
When the field keyword is arranged in the relevant search part of Search Results, q=-1; When the body part of field keyword search results, q=y, y ∈ 1~m, y are the particular location of sub-result in the Search Results text at keyword place, field.For example, if the sub-result of this field keyword in the text of Search Results is positioned at the 2nd, q=2 then,
Figure GDA00002300670300113
Wherein, λ is weighting constant; β is for adjusting constant, and in this example, λ can take from right constant e, and β can get 2.8.
We can find out through type 1, in the embodiment of the invention, whether the score that belongs to a field by result of page searching belongs to a field to keyword to be identified is judged, this be because, principle according to search engine, Search Results is to concern immediate content with field to be identified word, can characterize the attribute of field to be identified word.
For example, for culinary area, we have chosen 2 field keywords, are respectively " way " and " recipe ", and its weights are respectively: weight (way)=0.5; Weight (recipe)=0.2.
For field word Z to be identified, in engine, key in Z and obtain page P, total total m=10 sub-result, wherein " way " appears in position 1 and 2, and word frequency (being occurrence number) is respectively 3 and 2, recipe appears at respectively in the 1st and the 4th position, word frequency is respectively 5 and 6, and on " relevant search " position, 7 words appear in " way ", " recipe " occurs 8 times, and then the computation process of the mark of this page is:
Two field keywords " way " " recipe " are arranged on first position, must be divided into:
0.5*ln((10-1)/10+2.8)*3*ln(2+0.1)
+0.2*ln((10-1)/10+2.8)*5*ln(2+0.1)
A field keyword " recipe " is arranged on second position
Score: 0.2*ln((10-2)/10+2.8) * 2*ln (2+0.1)
………
Two field keywords are arranged, score in " the relevant element of searching " position:
0.5 * ( e 7 ) * 7 * ln ( 2 + 0.1 )
+ 0.2 * ( e 8 ) * 8 * ln ( 2 + 0.1 )
With all score additions, and divided by 10, just can obtain the mark of this page.
S306, the score of current search results page and the field degree of conformity threshold value of culinary area are compared, when described score reaches the field degree of conformity threshold value of culinary area, determine that " sharp green pepper shredded meat " belongs to described culinary area.
Example above the continuity suppose that the current search results page must be divided into 120 minutes, and the field degree of conformity threshold value of culinary area is 100 minutes, and then this field to be detected word of Z belongs to culinary area.
In the method that the embodiment of the invention provides, if determine that a field to be identified word is the A field, then need to pre-determine the field key word information in A field and the field degree of conformity threshold value in A field.In one embodiment of the invention, determine that the field key word information in A field and the process of field degree of conformity threshold value are: the Search Results that search engine is returned carries out block analysis, utilize the method for statistics that field keyword and the weight thereof in A field are calculated, in conjunction with field keyword and weight thereof the content in the Search Results is given a mark, then the score value of adding up each Search Results distributes, and then definite A field degree of conformity threshold value, set up complete field word model of cognition, utilize A field degree of conformity threshold value that field to be identified word is identified.
Fig. 4 is the structural representation of field word identification module.Fig. 5 is the method flow diagram that the identification module corresponding with Fig. 4 set up process, comprising:
S501, select the field word of N culinary area first, in search engine, this N field word searched for, and obtained N Search Results.
The quantity of information of each Search Results is very large, generally all comprises a plurality of pages, in the embodiment of the invention, in order to improve processing speed, chooses N the homepage content in the Search Results as the Search Results of N field word.
As can be seen from Figure 4, the importation of this model comprises word tabulation and web crawlers.The word tabulation is the field word that belongs to culinary area of having determined.In the embodiment of the invention, from the open entry of search dog, collected 3806 names of the dish (being N=3806) as the field word of culinary area, these names of the dish were searched for by google as search word.And web crawlers is the webpage gripping tool, is used for from 3806 Search Results content in the homepage of each Search Results of crawl as the Search Results of each search word in the present embodiment, therefore obtains 3806 pages.
S502 records sub-results' all in Search Results corresponding to each field word content, the zone of appearance and the position that occurs.
In this example, namely record the position of each page neutron result's content, region and appearance in 3806 pages.
S503, the keyword of choosing M culinary area from N Search Results form keyword set.
In this example, namely from 3806 pages, choose the field keyword of M culinary area.This field keyword refers to follow in 3806 pages 3806 field words to occur, and can embody the word of culinary area feature.By the content of 3806 pages is added up, the keyword of the culinary area of determining at last comprises the field keyword of 13 culinary area such as " way ", " name of the dish is complete works of ", i.e. M=13.
S504, according to the weights of each field keyword of calculation of parameter of field keyword.
In the embodiment of the invention, the parameter of this field keyword comprises position and the occurrence number that described field keyword occurs in each height result, the weights computing formula suc as formula 3 and formula 4 shown in:
Weight ( CW i ) = 1 n Σ j = 1 n 1 m Σ q = 1 m w ijq * f ijq * ln ( l i + α ) Formula 3
w ijq = ln ( m - q m + β ) q = y , y ∈ 1 ~ m λ f ijq q = - 1 Formula 4
Wherein: CW iBe i field keyword; Weight (CW i) be field keyword CW iWeights; N is the number of Search Results; M is the number of the sub-result in each Search Results text of record; Q is sub-result residing position in Search Results at keyword place, field; w IjqBe the weights of i field keyword q position in j Search Results; f IjqIt is the occurrence number of i field keyword q position in j Search Results; When the field keyword is arranged in the relevant search zone of Search Results, q=-1; Otherwise q=y, y ∈ 1~m, y are the sub-result's at keyword place, field the particular location in the Search Results text; l iThe word that is i field keyword is long; α is for adjusting constant; λ is weighting constant; β is for adjusting constant.
A kind of weights computing formula that formula 3 and formula 4 provide for the embodiment of the invention can be determined according to the parameter of this field keyword other weights computing formula in other embodiments, and the present invention does not do restriction to the concrete form of this weights computing formula.
When the weights of the field of calculating keyword, the embodiment of the invention has considered that the position of field keyword appearance is for the impact of its weights.The field keyword may appear at two zones in Search Results, one is at the Search Results body part, and one is in " relevant search " part.
Formula 4 is that the weights of i field keyword q position in j Search Results are (for being subjected to the weights of the field keyword that positional factor affects, need to distinguish to some extent with the weights of whole field keyword), the particular location of consideration field keyword in j Search Results, q gets different values.When the relevant search part of field keyword at j Search Results, q=-1; When the body part of field keyword at j Search Results, q gets the sub-result of this field keyword in the text of j Search Results particular location, for example, if the sub-result of this field keyword in the text of j Search Results is positioned at the 2nd, q=2 then.
According to the principle of search engine ordering, sort more backward, less with the correlativity of search terms.Therefore, in this example, the position that the field keyword occurs in Search Results is more forward, and its importance is stronger, so weights also should be maximum.And this importance is not linear change, selects ln((m-q in this example)/m)+β) this funtcional relationship represents.Its change curve has embodied the variation of importance well.For example, 10 child as a result the time (m=10) when having in the Search Results, (the weights here refer to the position weights by position influence to come the weights of first position, be not the weights of whole keyword) be exactly ln ((10-1)/10)=ln (0.9+ β), coming deputy weights is exactly ln ((10-2)/10)=ln (0.8+ β).In this example, β=2.8, it is interval to make function curve get (2.8~3.7) transverse axis, as shown in Figure 6, this section curve better match the Changing Pattern of location prominence.
Because relevant search is the statistics of a large number of users input, the keyword that therefore occurs on " relevant search " position is larger with the correlativity of search word.And closely related in the word frequency of this position with keyword.When keyword appeared at " relevant search " position, our utilization index function was simulated the relation of the word frequency of its importance and this position.The difference of the importance of face list placement in front of being embodied on this position with exponential function, and further strengthen its importance with the square root of the word frequency of this position as index, so we have chosen
Figure GDA00002300670300151
Represent, referring to Fig. 7, in this example, get λ=e.
S505, be result of page searching marking corresponding to each field word according to the parameter of each field keyword.
In the embodiment of the invention, the parameter of described field keyword comprises position, occurrence number, the weights of this field keyword and the length of field keyword that described field keyword occurs in each height result, can be result of page searching marking corresponding to each field word according to formula 5 and formula 6
Score ( P x ) = 1 m Σ q = 1 m Σ i = 1 k Weight ( CW i ) * w xiq * f xiq * ln ( l i + α ) Formula 5
w xiq = ln ( m - q m + β ) q = y , y ∈ 1 ~ m λ f iq q = - 1 Formula 6
Wherein, P xBe x the result of page searching that the field word is corresponding; Score (P x) be the classification score of result of page searching corresponding to x field word; Weight (CW i) be the weights of i field keyword; M is result of page searching neutron result's number; K is the field keyword number that occurs among the sub-result; w XiqFor in result of page searching corresponding to x field word, i field keyword is at the weights of q position; f XipBe the occurrence number of i field keyword in result of page searching corresponding to x field word in the p position; When the field keyword is arranged in the relevant search zone of Search Results, q=-1; Otherwise q=y, x ∈ 1~m, y are the sub-result's at keyword place, field the particular location in the Search Results text; l iThe word that is i field keyword is long; α is for adjusting constant; λ is weighting constant; β is for adjusting constant.
A kind of mark computing formula that formula 5 and formula 6 provide for the embodiment of the invention can be determined according to the parameter of this field keyword other mark computing formula in other embodiments, and the present invention does not do restriction to the concrete form of this mark computing formula.
In this example, 3806 field words have been chosen in culinary area, 13 field keywords have been chosen, in step 504, calculate after the weights of these 13 field keywords according to the weights computing formula so, utilize the weights of these 13 field keywords, give a mark according to formula 5 and 6 pairs of 3806 webpages of formula.
The score of S506, result of page searching that comprehensively each field word is corresponding is determined the field degree of conformity threshold value in the corresponding field of described field word.
Field degree of conformity threshold value is specifically determined according to the score of result of page searching corresponding to the field word of adding up.In this example, 3806 field words are corresponding culinary area.After for example by step S505 3806 webpages being given a mark, through the score of 3806 webpages is added up, find in 3806 pages that 99.8% page score more than or equal to 100, was the field degree of conformity threshold value of culinary area in 100 minutes so.
Can find out that in conjunction with Fig. 4 the output of this field word identification module comprises two parts, a part is the weights of this field word identification module corresponding field keyword and this field keyword, the field degree of conformity threshold value that another part is then determined for this identification module.
The identification module that the method that provides according to the embodiment of the invention is set up just can be identified any one field to be identified word after moulding, judges whether field to be identified word belongs to the corresponding field of field keyword in this identification module.
Can know by the method that the embodiment of the invention provides, need to be to any one field, when identifying such as the field word of field X, can at first utilize method shown in Figure 5 to set up the identification module in this field, determine after this identification module is set up, just can utilize this identification module that any one field to be identified word is identified, look at whether this field to be identified word belongs to the corresponding field of this identification module.
The method that the embodiment of the invention is provided and existing field word identification method compare us and can find, in the word identifying of existing field, only considered in some given texts, to identify the field word, this method is very strong for the judgement dependence of the given classification of text own, such as: if belong to " scientific and technological class " in one piece of text, the field word that so therefrom extracts all can be classified as " scientific and technological class "; But, in the field word that extracts, also have the field word of other classifications, because in scientific and technological class text, can mention the field word of other field.Compared to the prior art text selecting, the embodiment of the invention is to utilize search engine that the language material that field to be identified word is identified searched for to obtain in field word to be identified.Compared to prior art, the method that the embodiment of the invention provides is more extensive aspect identification the obtaining of language material, and specific aim is also stronger.
In addition, in the existing field word identification method, the accuracy of field word can't be confirmed by large quantitative statistics.For example, some field word is to extract from several pieces of texts or even one piece of text, and its accuracy that belongs to this field can not guarantee; And the field word in the field word identification method that the embodiment of the invention provides is take search engine as the basis, screens from a large amount of Search Results, so the field word of determining in the method that the embodiment of the invention provides has higher accuracy.
The field word that existing field word identification method utilizes statistical knowledge and the knowledge of grammar to carry out extracts, and has strengthened the complicacy of field word identification.Present method has been utilized some statistical models and syntax rule, and the utilization of these methods has increased the complicacy of identification on the contrary, and for the but not too large help of raising of the accuracy of field word identification, usability is relatively poor.And the method that the embodiment of the invention provides obtains the language material relevant with field to be identified word by search engine, field to be identified word is identified by adding up definite field keyword and the parameter of field keyword in conjunction with passing through in advance in the language material, when improving the identification accuracy, improved applicability and the practicality of field word identification.
Referring to Fig. 8, one embodiment of the invention also provides a kind of field word recognition device, comprising:
Acquiring unit 801 is used at search engine search field to be identified word, obtains the sub-result in the Search Results and record each height result the position to occur;
Analytic unit 802 is used for determining the field keyword that the sub-result of described Search Results occurs in conjunction with predetermined field key word information that described field key word information comprises the weights in field keyword and this field keyword field under it;
In the embodiment of the invention, the weights computing formula can with reference to formula 3 and formula 4, repeat no more herein.
Computing unit 803 calculates the score that described field to be identified word belongs to the corresponding field of described field keyword for the weights of the field keyword that occurs according to described sub-result and position and the number of times that described field keyword occurs in each height result;
The computing formula that field to be identified word belongs to the score in the corresponding field of described field keyword can referring to formula 1 and formula 2, repeat no more herein.
Evaluation unit 804 is used for determining that described field to be identified word belongs to field corresponding to keyword, described field when described score reaches predetermined field degree of conformity threshold value.
Referring to Fig. 9, described analytic unit 802 comprises:
Search subelement 901 for N field word selecting in advance a field, is searched for described N field word in search engine, obtains N Search Results;
Record subelement 902 is used for the sub-result of each Search Results of record and records each height result the position occurring;
The first computation subunit 903, be used for choosing M field keyword from N Search Results, calculate the weights that each keyword belongs to the corresponding field of field keyword according to position, the number of times of appearance and the length gauge of this field keyword that each field keyword occurs in each Search Results;
The computing formula of the weights in the corresponding field of every field keyword can referring to formula 3 and formula 4, repeat no more herein.
First determines subelement 904, is used for determining the field keyword that the sub-result of described Search Results occurs in conjunction with the field keyword.
Referring to Figure 10, described evaluation unit 804 comprises:
The second computation subunit 1001, weights, each field keyword of being used for according to each field keyword count the score that each Search Results belongs to the corresponding field of field keyword in position and the occurrence number that each Search Results occurs;
The computing formula that each Search Results belongs to the score in the corresponding field of field keyword can referring to formula 5 and formula 6, repeat no more herein.
Second determines subelement 1002, determines the field degree of conformity threshold value in the corresponding field of field keyword for the score that belongs to keyword corresponding field in field according to N the Search Results of adding up;
The 3rd determines subelement 1003, is used for determining that described field to be identified word belongs to field corresponding to keyword, described field when described score reaches the field degree of conformity threshold value in the corresponding field of field keyword.
The device that the embodiment of the invention provides is take search engine as the basis, may belong to the field keyword in field from search engine to determining field to be identified word the Search Results of field to be identified word, and calculate the score that described field to be identified word belongs to this field according to the information of predetermined these field keywords in conjunction with Search Results, with the field degree of conformity threshold ratio in this score and this field, determine according to comparative result whether field to be identified word belongs to this field.The device that the embodiment of the invention provides utilizes the characteristics of search engine itself to obtain the language material very large with the field to be identified word degree of correlation, has greatly improved speed and the accuracy of field word identification.
In order to verify the accuracy of the method that the embodiment of the invention provides, in an embodiment of the present invention, utilize existing 8,323,460 Chinese entries (comprising monosyllabic word and part short sentence) to test, grasped the homepage of Search Results from Google, predetermined weights information, threshold value and mark computing formula are to 8,323, and 460 webpages are tested, and the process aftertreatment obtains 15,372 of name of the dish words.5% statistics of carrying out accuracy rate randomly drawed in 15372 entries, and statistics is as shown in table 1:
Table 1
Randomly draw the word number Positive exact figures Accuracy rate
For the first time 768 750 0.9765625
For the second time 768 742 0.9661458
For the third time 768 749 0.9752604
The 4th time 768 747 0.9726563
Mean value 768 747 0.9726563
As can be seen from Table 1, the method that the embodiment of the invention provides is through verification experimental verification, and average accuracy reaches more than 97%.
The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract data type, program, object, assembly, data structure etc.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, be executed the task by the teleprocessing equipment that is connected by communication network.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
The above only is preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (13)

1. a field word identification method is characterized in that, comprising:
Search field to be identified word in search engine obtains the sub-result in the Search Results and records each height result and the position occurs;
Determine the field keyword that occurs among the sub-result of described Search Results in conjunction with predetermined field key word information, described field key word information comprises the weights in field keyword and this field keyword field under it;
Belong to the score in the corresponding field of described field keyword according to the described field to be identified of the calculation of parameter word of described field keyword, the parameter of described field keyword comprises position and the occurrence number that described field keyword occurs in each height result;
Relatively in described score and predetermined field degree of conformity threshold value, determine that according to comparative result described field to be identified word belongs to field corresponding to keyword, described field.
2. method according to claim 1 is characterized in that, the parameter of described field keyword also comprises: the length of described field keyword.
3. method according to claim 2 is characterized in that, the parameter of described field keyword also comprises: the weights of described field keyword.
4. method according to claim 3 is characterized in that, the score that belongs to keyword corresponding field in described field according to the described field to be identified of the calculation of parameter of each field keyword word comprises:
Calculate the score that described field to be identified word belongs to the corresponding field of described field keyword according to following formula:
Score ( P ) = 1 m Σ q = 1 m Σ i = 1 k Weight ( CW i ) * w iq * f iq * ln ( l i + α ) ,
w iq = ln ( m - q m + β ) q = y , y ∈ 1 ~ m λ f iq q = - 1 ,
Wherein: P is Search Results corresponding to field to be identified word; Score (P) belongs to the score in the corresponding field of described field keyword for field to be identified word; Weight (CW i) be the weights of i field keyword; M is the sub-result's that records number; K is the field keyword number that occurs among the sub-result; w IqIn Search Results, i field keyword is at the weights of q position; f IqBe that i field keyword is in the occurrence number of q position; l iThe word that is i field keyword is long; α is the long constant of adjusting of word; λ is weighting constant; β adjusts constant, when the field keyword is arranged in the relevant search zone of Search Results, and q=-1; Otherwise q=y, y ∈ 1~m, y are the sub-result's at keyword place, field the particular location in the Search Results text.
5. the described method of any one is characterized in that according to claim 1~3, and definite process of described field key word information comprises:
Select N field word in the field, in search engine, described N field word searched for, obtain N Search Results;
Record the sub-result in each Search Results and record each height result and the position occurs;
From N Search Results, choose M field keyword, go out the weights that each field keyword belongs to the corresponding field of field keyword according to the calculation of parameter of each field keyword.
6. method according to claim 5, it is characterized in that, when the parameter of described field keyword comprises position, the number of times of appearance and the length of this keyword that each field keyword occurs in each Search Results, go out the weights that each field keyword belongs to the corresponding field of field keyword according to the calculation of parameter of described field keyword and comprise:
Calculate the weights that each field keyword belongs to its corresponding field according to following formula:
Weight ( CW i ) = 1 n Σ j = 1 n 1 m Σ q = 1 m w ijq * f ijq * ln ( l i + α ) ,
w ijq = ln ( m - q m + β ) q = y , y ∈ 1 ~ m λ f ijq q = - 1 ,
Wherein: CW iBe i field keyword; Weight (CW i) be field keyword CW iWeights; N is the number of Search Results; M is the number of the sub-result in each Search Results text of record; Q is sub-result residing position in Search Results at keyword place, field; w IjqBe the weights of i field keyword q position in j Search Results; f IjqIt is the occurrence number of i field keyword q position in j Search Results; When the field keyword is arranged in the relevant search zone of Search Results, q=-1; Otherwise q=y, y ∈ 1~m, y are the sub-result's at keyword place, field the particular location in the Search Results text; l iThe word that is i field keyword is long; α is for adjusting constant; λ is weighting constant; β is for adjusting constant.
7. the described method of any one is characterized in that according to claim 1~3, and definite process of described field degree of conformity threshold value comprises:
Parametric statistics according to each field keyword goes out the score that each Search Results belongs to the corresponding field of field keyword;
The score that belongs to the corresponding field of field keyword according to each Search Results of statistics is determined the field degree of conformity threshold value in the corresponding field of field keyword.
8. method according to claim 7, it is characterized in that, when the parameter of described field keyword comprises position, the number of times of appearance, the weights of this field keyword and the length of this keyword that each field keyword occurs in each Search Results, go out the score that each Search Results belongs to the corresponding field of field keyword according to the parametric statistics of described field keyword and comprise:
Calculate the score that each Search Results belongs to the corresponding field of field keyword according to following formula:
Score ( P x ) = 1 m Σ q = 1 m Σ i = 1 k Weight ( CW i ) * W xiq * f xiq * ln ( l i + α ) ,
w xiq = ln ( m - q m + β ) q = y , y ∈ 1 ~ m λ f iq q = - 1 ,
Wherein, P xBe x the result of page searching that the field word is corresponding; Score (P x) be the classification score of result of page searching corresponding to x field word; Weight (CW i) be the weights of i field keyword; M is result of page searching neutron result's number; K is the field keyword number that occurs among the sub-result; w XiqFor in result of page searching corresponding to x field word, i field keyword is at the weights of q position; f XipBe the occurrence number of i field keyword in result of page searching corresponding to x field word in the p position; When the field keyword is arranged in the relevant search zone of Search Results, q=-1; Otherwise q=y, x ∈ 1~m, y are the sub-result's at keyword place, field the particular location in the Search Results text; l iThe word that is i field keyword is long; α is for adjusting constant; λ is weighting constant; β is for adjusting constant.
9. a field word recognition device is characterized in that, comprising:
Acquiring unit is used at search engine search field to be identified word, obtains the sub-result in the Search Results and record each height result the position to occur;
Analytic unit is used for determining the field keyword that the sub-result of described Search Results occurs in conjunction with predetermined field key word information that described field key word information comprises the weights in field keyword and this field keyword field under it;
Computing unit, the described field to be identified of the calculation of parameter word that is used for the field keyword that occurs according to described sub-result belongs to the score in the corresponding field of described field keyword, and the parameter of described field keyword comprises position and the occurrence number that described field keyword occurs in each height result;
Evaluation unit is used for comparing in described score and predetermined field degree of conformity threshold value, determines that according to comparative result described field to be identified word belongs to field corresponding to keyword, described field.
10. device according to claim 9 is characterized in that, the parameter of described field keyword also comprises: the length of described field keyword.
11. device according to claim 10 is characterized in that, the parameter of described field keyword also comprises: the weights of described field keyword.
12. the described device of any one is characterized in that according to claim 9~11, described analytic unit comprises:
The search subelement for N field word selecting in advance a field, is searched for described N field word in search engine, obtains N Search Results;
The record subelement is used for the sub-result of each Search Results of record and records each height result the position occurring;
The first computation subunit is used for choosing M field keyword from N Search Results, goes out the weights that each field keyword belongs to keyword correspondence field, field according to the calculation of parameter of each field keyword;
First determines subelement, is used for determining the field keyword that the sub-result of described Search Results occurs in conjunction with the field keyword.
13. the described device of any one is characterized in that according to claim 9~11, described evaluation unit comprises:
The second computation subunit is used for going out the score that each Search Results belongs to the corresponding field of field keyword according to the parametric statistics of each field keyword;
Second determines subelement, determines the field degree of conformity threshold value in the corresponding field of field keyword for the score that belongs to keyword corresponding field in field according to each Search Results of adding up;
The 3rd determines subelement, when reaching field degree of conformity threshold value for the score that belongs to this field at described field to be identified word, determines that described field to be identified word belongs to field corresponding to keyword, described field.
CN 200910241287 2009-11-27 2009-11-27 Field word identification method and device Active CN102081601B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200910241287 CN102081601B (en) 2009-11-27 2009-11-27 Field word identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200910241287 CN102081601B (en) 2009-11-27 2009-11-27 Field word identification method and device

Publications (2)

Publication Number Publication Date
CN102081601A CN102081601A (en) 2011-06-01
CN102081601B true CN102081601B (en) 2013-01-09

Family

ID=44087569

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200910241287 Active CN102081601B (en) 2009-11-27 2009-11-27 Field word identification method and device

Country Status (1)

Country Link
CN (1) CN102081601B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049436B (en) * 2011-10-12 2015-11-25 北京百度网讯科技有限公司 Obtain method and device, the method and system of generation translation model, the method and system of mechanical translation of language material
CN103136256B (en) * 2011-11-30 2016-08-03 阿里巴巴集团控股有限公司 One realizes method for information retrieval and system in a network
CN102609500A (en) * 2012-02-01 2012-07-25 北京百度网讯科技有限公司 Question push method, question answering system using same and search engine
CN103258053B (en) * 2013-05-31 2018-01-26 深圳市宜搜科技发展有限公司 The extracting method and system of a kind of domain feature words
CN103559253A (en) * 2013-10-31 2014-02-05 北京奇虎科技有限公司 Related vertical resource search method and equipment
CN104504037B (en) * 2014-12-15 2018-07-06 深圳市宜搜科技发展有限公司 Entity word temperature calculates method and device
CN105528404A (en) * 2015-12-03 2016-04-27 北京锐安科技有限公司 Establishment method and apparatus of seed keyword dictionary, and extraction method and apparatus of keywords
CN105630975B (en) * 2015-12-24 2020-10-27 联想(北京)有限公司 Information processing method and electronic equipment
CN108052503B (en) * 2017-12-26 2021-04-27 北京奇艺世纪科技有限公司 Confidence coefficient calculation method and device
CN111737560B (en) * 2020-07-20 2021-01-08 平安国际智慧城市科技股份有限公司 Content search method, field prediction model training method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005141428A (en) * 2003-11-05 2005-06-02 Nippon Telegr & Teleph Corp <Ntt> Word string extracting method and device, and recording medium with word string extracting program recorded
CN101246502A (en) * 2008-03-27 2008-08-20 腾讯科技(深圳)有限公司 Method and system for searching pictures in network
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005141428A (en) * 2003-11-05 2005-06-02 Nippon Telegr & Teleph Corp <Ntt> Word string extracting method and device, and recording medium with word string extracting program recorded
CN101246502A (en) * 2008-03-27 2008-08-20 腾讯科技(深圳)有限公司 Method and system for searching pictures in network
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge

Also Published As

Publication number Publication date
CN102081601A (en) 2011-06-01

Similar Documents

Publication Publication Date Title
CN102081601B (en) Field word identification method and device
Waitelonis et al. Linked data enabled generalized vector space model to improve document retrieval
JP3882048B2 (en) Question answering system and question answering processing method
CN103020164B (en) Semantic search method based on multi-semantic analysis and personalized sequencing
CN105138558B (en) The real time individual information collecting method of content is accessed based on user
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN101630327A (en) Design method of theme network crawler system
CN105302793A (en) Method for automatically evaluating scientific and technical literature novelty by utilizing computer
CN105045875B (en) Personalized search and device
CN101609450A (en) Web page classification method based on training set
CN107807987A (en) A kind of string sort method, system and a kind of string sort equipment
CN107992542A (en) A kind of similar article based on topic model recommends method
CN106156204A (en) The extracting method of text label and device
CN101751455B (en) Method for automatically generating title by adopting artificial intelligence technology
CN104636465A (en) Webpage abstract generating methods and displaying methods and corresponding devices
CN106372061A (en) Short text similarity calculation method based on semantics
CN103077190A (en) Hot event ranking method based on order learning technology
CN103049470A (en) Opinion retrieval method based on emotional relevancy
CN105975596A (en) Query expansion method and system of search engine
CN102567308A (en) Information processing feature extracting method
CN104484380A (en) Personalized search method and personalized search device
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN101887415B (en) Automatic extraction method for text document theme word meaning
Man Feature extension for short text categorization using frequent term sets
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: BEIJING KINGSOFT OFFICE SOFTWARE CO., LTD.

Free format text: FORMER OWNER: BEIJING JINSHAN SOFTWARE CO., LTD.

Effective date: 20140312

Free format text: FORMER OWNER: BEIJING JINSHAN DIGITAL ENTERTAINMENT SCIENCE AND TECHNOLOGY CO., LTD.

Effective date: 20140312

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20140312

Address after: Kingsoft No. 33 building, 100085 Beijing city Haidian District Xiaoying Road

Patentee after: Beijing Kingsoft WPS Office Co., Ltd.

Address before: Kingsoft 33 Building No. 100085 Beijing Haidian District City 1 Xiaoying Road West

Patentee before: Beijing Jinshan Software Co., Ltd.

Patentee before: Beijing Jinshan Digital Entertainment Science and Technology Co., Ltd.

C56 Change in the name or address of the patentee
CP01 Change in the name or title of a patent holder

Address after: Kingsoft No. 33 building, 100085 Beijing city Haidian District Xiaoying Road

Patentee after: Beijing Kingsoft office software Limited by Share Ltd

Address before: Kingsoft No. 33 building, 100085 Beijing city Haidian District Xiaoying Road

Patentee before: Beijing Kingsoft WPS Office Co., Ltd.