CN101599075B - Chinese abbreviation processing method and device therefor - Google Patents

Chinese abbreviation processing method and device therefor Download PDF

Info

Publication number
CN101599075B
CN101599075B CN2009100883776A CN200910088377A CN101599075B CN 101599075 B CN101599075 B CN 101599075B CN 2009100883776 A CN2009100883776 A CN 2009100883776A CN 200910088377 A CN200910088377 A CN 200910088377A CN 101599075 B CN101599075 B CN 101599075B
Authority
CN
China
Prior art keywords
abbreviation
similarity
candidate
phrase
screening
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009100883776A
Other languages
Chinese (zh)
Other versions
CN101599075A (en
Inventor
谢丽星
孙茂松
佟子健
王灿辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Original Assignee
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Beijing Sogou Technology Development Co Ltd filed Critical Tsinghua University
Priority to CN2009100883776A priority Critical patent/CN101599075B/en
Publication of CN101599075A publication Critical patent/CN101599075A/en
Application granted granted Critical
Publication of CN101599075B publication Critical patent/CN101599075B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a Chinese abbreviation processing method and a device therefor, belonging to the text information processing field. The method comprises: pre-treating all the query terms in a log queried by a user; aggregating the query terms referring to the same catalogue of the same website in the pre-treated query log into one group to obtain plural groups; executing the query terms in each group; generating plural candidate pairs matching the source phase and the abbreviation in the group according to the word alignment rule; filtering out the place names in the source phase if the source phase contains place names and the abbreviation has no morpheme corresponding to the place name; and screening the result of filtration in the group according to a preset rule to obtain a collection of the pairs of source phase and abbreviation in the group. The device comprises a pre-treating module, a candidate pair generating module, a filtering module and a screening module. The invention utilizes a user query log to exploring Chinese abbreviations, improving timeliness and accuracy of the pairs of source phases and abbreviations.

Description

Chinese abbreviation disposal route and device
Technical field
The present invention relates to text information processing field, particularly Chinese abbreviation disposal route and device.
Background technology
The abbreviation finger speech calls the turn by fixing saying through overcompression, omits or is all together and the word that forms.The economy principle of natural language has caused the appearance of abbreviation, through speech is carried out breviary, can be good at playing the effect that refining is expressed, like " Peking University " abbreviation " Beijing University ".Abbreviation is very common in natural language, in neologisms, has occupied ratio greatly.
Because a large amount of uses of abbreviation; Formed the main source of not logining neologisms in the natural language processing; Caused machine when handling Chinese information, participle, part-of-speech tagging, the meaning of a word confirm with problems such as ambiguity eliminating, named entity recognition and entity coreference resolution on have serious hindrance.Simultaneously, because different on the top layer of former form and contraction also can impact application such as information retrieval, keyword abstraction, mechanical translation, question answering systems.For example, with " Peking University " as retrieving head, may omission to the text that contains " Beijing University ", vice versa.This shows that it is important basic work in the natural language processing that abbreviation is handled.
Because the complicacy of abbreviation generation type and emerging in an endless stream of neologisms, Chinese abbreviation dictionary version is less at present, is mainly write according to personal knowledge by the expert, is difficult to limit, and upgrades slower.The Chinese abbreviation is widely used, and there are some researches show, nearly 20% sentence can use abbreviation in the headline.And because the terseness of abbreviation, abbreviation also becomes more and more popular in daily life and network, so the research of Chinese abbreviation identification seems particularly urgent and important.
After prior art is analyzed; The inventor finds that prior art has following shortcoming at least: mostly prior art employed corpus when identification Chinese abbreviation is non-true environment, and scale is less, and is ageing not good enough; What have also needs manual intervention, and the experimental result accuracy rate is lower.
Summary of the invention
The embodiment of the invention provides a kind of Chinese abbreviation disposal route and device.Said technical scheme is following:
A kind of Chinese abbreviation disposal route is applied to retrieve in the search engine, comprising:
All query words in the user inquiring daily record carry out pre-service;
It is one group that the query word that points to the same catalogue in same website in the pretreated inquiry log is assembled, and obtains a plurality of groups;
To the query word in each group, carry out:
The a plurality of candidates that generate interior source phrase of this group and abbreviation coupling according to the word alignment rule are right;
Right for each candidate, if source phrase wherein has place name, and there is not morpheme corresponding in the abbreviation wherein with said place name, then filter out the place name in the phrase of said source;
Result after to this group inner filtration screens according to preset rules, obtains this and organizes endogenous phrase and the right set of abbreviation, specifically comprises:
In the result of this group behind inner filtration, to removing, the candidate who does not comprise name is to keeping to the candidate that comprises name, obtains the result after the screening for the first time;
Among the result after screening for the first time, the speech that keeps the head and the tail of source phrase has the candidate of corresponding morpheme right in abbreviation, obtains the result behind the programmed screening;
Among the result behind programmed screening, according to web page interlinkage similarity, co-occurrence similarity and text similarity to the candidate to screening, obtain this and organize endogenous phrase and the right set of abbreviation.
A kind of Chinese abbreviation treating apparatus is applied to retrieve in the search engine, comprising:
Pre-processing module is used for all query words of user inquiring daily record are carried out pre-service;
The related term concentrating module, being used for the query word of the same catalogue in the pretreated inquiry log same website of sensing is assembled is one group, obtains a plurality of groups;
The candidate is used for the query word to each group to generation module, carries out: a plurality of candidates that generate interior source phrase of this group and abbreviation coupling according to the word alignment rule are right;
Filtering module, it is right to be used for for each candidate, if source phrase wherein has place name, and does not have morpheme corresponding with said place name in the abbreviation wherein, then filters out the place name in the phrase of said source;
Screening module is used for screening according to the result of preset rules after to this group inner filtration, obtains this and organizes endogenous phrase and the right set of abbreviation, specifically comprises:
First submodule is used for the result behind this group inner filtration, and to removing, the candidate who does not comprise name is to keeping to the candidate that comprises name, obtains the result after the screening for the first time;
Second submodule is used for the result after screening for the first time, and the speech that keeps the head and the tail of source phrase has the candidate of corresponding morpheme right in abbreviation, obtains the result behind the programmed screening;
The 3rd submodule is used for the result behind programmed screening, according to web page interlinkage similarity, co-occurrence similarity and text similarity to the candidate to screening, obtain this and organize endogenous phrase and the right set of abbreviation.
The embodiment of the invention is utilized the user inquiring daily record; From the user inquiring daily record, excavate the Chinese abbreviation; And through a series of filtration and method of selection; From the real corpus storehouse, obtain abbreviation, the right preferable results set of source phrase fast, improved abbreviation, the right ageing and accuracy of source phrase.
Description of drawings
Fig. 1 is embodiment of the invention Chinese abbreviation process flow figure;
To be the embodiment of the invention carry out the method for screening process flow diagram according to the result of preset rules after to this group inner filtration to Fig. 2;
Fig. 3 is among the result of the embodiment of the invention behind programmed screening, according to web page interlinkage similarity, co-occurrence similarity and text similarity to the candidate to carrying out the method for screening process flow diagram;
Fig. 4 is an embodiment of the invention Chinese abbreviation treating apparatus synoptic diagram.
Embodiment
For making the object of the invention, technical scheme and advantage clearer, will combine accompanying drawing that embodiment of the present invention is done to describe in detail further below.
Embodiment one
The embodiment of the invention provides a kind of Chinese abbreviation disposal route, and is as shown in Figure 1, comprising:
110: all query words in the user inquiring daily record carry out pre-service.
Remove the noise query word in the user inquiring daily record.The noise query word here mainly is meant the query word that contains outer Chinese character, mess code.Above-mentioned pre-service also comprises the numeral of filtering in the query word, full-shape letter, punctuation mark, space etc.
120: it is one group that the query word that points to the same catalogue in same website in the pretreated inquiry log is assembled, and obtains a plurality of groups; To the query word in each group, execution in step 130,140,150.
After the pre-service in a step, the query word entry of reservation is normal Chinese Query speech mostly on process.In inquiry log, a record generally comprises following content: query word, a URL (Uniform Resource Locator, URL is also referred to as web page address) and the corresponding user's clicks of this URL that this query word is corresponding.It is one group that the query word that points to the same catalogue in same website in the pretreated inquiry log is assembled, and obtains a plurality of groups.The query word that points to the same catalogue in same website refers in inquiry log, and corresponding URL is for pointing to the query word of the same catalogue in same website.The query word that corresponding URL points to the same catalogue in same website is comparatively relevant.For example: the URL that query word " Peking University " is corresponding is www.pku.edu.cn; And " Beijing University " also corresponding to www.pku.edu.cn, and therefore " Peking University " is higher with " Beijing University " correlativity.
Identical URL is the URL that points to the same catalogue in same website.For different URL, in the embodiment of the invention, judge that different URL is that the method for pointing to the URL of the same catalogue in same website is:
URL generally is with http: // beginning, in its network address of intercepting the 3rd "/" before, http: // afterwards part, (as: http://sports.sohu.com/nba.shtml only keeps sports.sohu.com to the result after obtaining keeping; If only contain three "/", the 3rd "/" do not have content, then http afterwards: // after content all keep, only keep www.sohu.com such as http://www.sohu.com/).If different URL according to the method described above, the result after the reservation is identical, thinks that then they are the URL that point to the same catalogue in same website.
Down in the face of query word assemble for each group after a plurality of groups as present group, execution in step 130,140,150.
130: a plurality of candidates that generate interior source phrase of this group and abbreviation coupling according to the word alignment rule are right.
The word alignment rule refers to: (1) number of words is less thinks abbreviation, and number of words is more thinks the source phrase; (2) all order appearance in the phrase of source of each word in the abbreviation.This shows, if all order appearance in query word B of each word among the query word A, and query word A lacks than query word B number of words; Then be selected; Source phrase and abbreviation matched candidate as the embodiment of the invention are right, and wherein query word A is an abbreviation, and query word B is the source phrase.
In the present embodiment, right according to all candidates of source phrase in the word alignment rule generation present group and abbreviation coupling.Such as; In the step 120 with the query word that points to the same catalogue in same website in the pretreated inquiry log assemble be one group after, the group polling speech in obtain a plurality of groups is: " Beijing University ", " Beijing University gives birth to ", " Peking University ", " Peking University gives birth to " " Beijing University's enrollment of universities " four speech, the candidate who so here extracts according to the word alignment rule is to being: (Beijing University, Beijing University give birth to); (Beijing University, Peking University); (Beijing University, Peking University give birth to), (Beijing University gives birth to, Peking University give birth to), (Peking University, Peking University give birth to); (Beijing University, Beijing University's enrollment of universities); (Peking University, Beijing University's enrollment of universities), (Beijing University's life, Beijing University's enrollment of universities), a candidate is right in each bracket.
140: right for each candidate, if source phrase wherein has place name, and there is not morpheme corresponding in the abbreviation wherein with this place name, then filter out the place name in this source phrase.
Extract the candidate to after because place name can impact abbreviation, therefore want the place name of selective filter candidate centering.If source phrase wherein has place name, and do not have morpheme corresponding in the abbreviation wherein, then filter out the place name in the phrase of said source with said place name.As for the candidate to (in the city one, first middle school, Shenyang City), " Shenyang " in the source phrase " first middle school, Shenyang City " does not have corresponding morpheme in abbreviation " in the city one " here; And " city " do not have any meaning for the identification of abbreviation, so this candidate is to being treated to the candidate to (in one, first middle school); Correspondingly; (in the Shenyang one, first middle school, ShenYang, Liaoning Province city) also should be treated to the candidate to (in one, first middle school); And for the candidate to the place name " Beijing " in (Beijing University, Peking University) owing to there is " north " corresponding with it in the abbreviation " Beijing University ", and do not have Beijing in the abbreviation, so should keep, promptly this is not to dealing with.
150: the result after to this group inner filtration screens according to preset rules, obtains this and organizes endogenous phrase and the right set of abbreviation.
Referring to Fig. 2, step 150 specifically may further comprise the steps:
210: in the result of this group behind inner filtration, to removing, the candidate who does not comprise name is to keeping to the candidate that comprises name, obtains the result after the screening for the first time.
Present group is after through the filtration in the step 140, and to removing, the candidate who does not comprise name is to keeping to the candidate that comprises name, obtains the result after the screening for the first time.Because comprising the phrase of name is not abbreviation usually, so need directly remove in this step, such as for the candidate to (Wang Wei, Wang Wei's elder brother), because " Wang Wei " be name, so this a pair of will removal from candidate's centering of present group.
Candidate's centering also often exists an abbreviation corresponding to the phenomenon that surpasses 3 source phrase in addition; Such as the candidate's centering abbreviation " Beijing University " described in the step 130 corresponding surpass 3 source phrase; Therefore also according to the clicks information of query word in the user inquiring daily record abbreviation is become " a pair of three " corresponding to the phenomenon of multiple source phrase in the present embodiment, promptly an abbreviation is at the most corresponding to three source phrases.Specific practice is: choose three maximum source phrases of user click frequency in the corresponding source phrase of same abbreviation.
220: among the result after screening for the first time, the speech that keeps the head and the tail of source phrase has the candidate of corresponding morpheme right in abbreviation, obtains the result behind the programmed screening.
The generation type of the abbreviation of candidate's centering is more, and is also different with the matching degree of source phrase.The speech of the head and the tail of source phrase has the matching degree of abbreviation and source phrase of candidate's centering of corresponding morpheme higher in abbreviation.The speech of the head and the tail of above-mentioned source phrase has the candidate of corresponding morpheme to being divided three classes in abbreviation: (1) morpheme constitutes type: promptly in the phrase of source each speech corresponding to each speech in the abbreviation, like Peking University---Beijing University; (2) mix class: promptly mixing method constitutes, and abbreviation is compared with the source phrase and do not lacked any speech, as: setting-up exercises to radio music---setting-up exercises to music; (3) disappearance type: i.e. some speech of intercalary delection, every speech one word of remaining speech, and head and the tail do not lack speech, as: the People's Republic of China (PRC)---China.
230: among the result behind programmed screening, according to web page interlinkage similarity, co-occurrence similarity and text similarity to the candidate to screening, obtain this and organize endogenous phrase and the right set of abbreviation.
For web page interlinkage similarity aspect, because abbreviation has identical semanteme with the source phrase, so we think that it roughly is identical with the content that the source phrase obtains that the use search engine is retrieved abbreviation respectively, and the link that shows as URL is comparatively similar.For co-occurrence similarity aspect; Because abbreviation and source phrase have the co-occurrence phenomenon usually,, use modes such as source phrase in the text such as using abbreviation in the title; Abbreviation and source phrase can appear at close position jointly; Therefore they have higher co-occurrence frequency, show as and utilize search engine that it is retrieved, and the two may appear in the same section summary in the result who obtains.The semantic dependency of abbreviation and source phrase is high more, and possible co-occurrence number of times is many more.For text similarity; Because abbreviation is similar with the two semanteme of source phrase; Therefore using the two content of text that obtains of search engine retrieving possibly be that similarly perhaps they do not derive from same URL, but possibly be same piece of writing article or same theme.Therefore text similarity is high more, and abbreviation and source phrase be coupling more.
Referring to Fig. 3, among the result behind programmed screening that each candidate is right as current candidate to respectively, carry out following steps:
310: calculate the right web page interlinkage similarity of current candidate, co-occurrence similarity and text similarity respectively.
Web page interlinkage calculation of similarity degree method specifically comprises:
(1) abbreviation and the source phrase of current candidate's centering are searched for as query word inputted search engine respectively, and in all results that search engine searches obtains, got the Search Results of this source phrase of Search Results and first preset number of this abbreviation of first preset number.In the present embodiment, in all results that search engine searches obtains, get the Search Results of preceding 20 these abbreviations and the Search Results of preceding 20 these source phrases.If 20 of the less thaies as a result that search engine searches obtains, with its actual number that obtains as first preset number.
(2) similar counting is carried out initialization (being initialized as 0 in the present embodiment), and to each bar in the Search Results of this abbreviation of taking out respectively as work as preceding article, execution:
To work as the corresponding web page interlinkage of preceding article, compare successively, when identical, stop comparison, said similar counting will be added 1 with the corresponding web page interlinkage of Search Results of this source phrase that takes out; Present embodiment is in when comparison, the first order of the corresponding web page interlinkage of comparison search result only, just the 3rd "/" part before in the said network address in the step 120.
After each bar in the Search Results of this abbreviation that takes out is all accomplished comparison with the Search Results of this source phrase that takes out; Calculate the web page interlinkage similarity according to the similar counting and first preset number; Web page interlinkage calculation of similarity degree formula is:
Figure GSB00000639575400061
wherein
P 1Be the web page interlinkage similarity, countA is similar counting.
Co-occurrence calculation of similarity degree method specifically comprises:
(1) with the abbreviation of current candidate's centering and source phrase simultaneously as query word inputted search engine search, and in the result of search engine searches, get the Search Results of second preset number.In the present embodiment, in the result of search engine searches, get preceding 20 Search Results.If 20 of the less thaies as a result that search engine searches obtains, with its actual number that obtains as second preset number.
(2) the co-occurrence number of times is carried out initialization (being initialized as 0 in the present embodiment), as current summary, carry out: if this abbreviation all occurred with this source phrase in the current summary, then the co-occurrence number of times adds 1 the summary of each Search Results correspondence of taking out.
Calculate the co-occurrence similarity according to the co-occurrence number of times and second preset value; Co-occurrence calculation of similarity degree formula is:
Figure GSB00000639575400062
wherein
P 2Be the co-occurrence similarity, countB is the co-occurrence number of times.
Present embodiment calculates text similarity by the method for prior art text classification, specifically comprises:
(1) use corpus (such as " macropaedia ") to train; Divide morphology (binary divides morphology to be the double word cutting, such as a sentence " I like Tsing-Hua University ", adopts the binary participle can obtain 5 speech: " I like ", " liking clear ", " Tsing-Hua University ", " China is big ", " university ") through binary; Select 6; 0000 speech is as character representation, so the dimension of characteristic vector space is 6,0000 dimensions.
(2) right for the candidate among the result who obtains behind the above-mentioned programmed screening, abbreviation and source phrase are sent into search engine inquiry as keyword respectively, and respectively preceding 20 summaries are separately write file as one piece of text.
(3) to above-mentioned file, for the text of two retrieval of content of abbreviation and source phrase, if the speech that obtains in the text behind the binary participle belong to 6,0000 tie up scopes speech, calculate the weight (formula calculating calculated as described below) of each speech, the weight a of each speech IjBe the value of each dimension on the proper vector, thus, the abbreviation text obtains the vector of a correspondence
Figure GSB00000639575400071
Source phrase text also obtains the vector of a correspondence
Figure GSB00000639575400072
Speech weight calculation formula: a Ij = Log ( TF Ij + 1.0 ) * Log ( N / DF i ) Σ k [ Log ( TF Kj + 1.0 ) * Log ( N / DF k ) ] 2 (formula one),
Wherein, i ∈ [1,60000], j are 1 or 2, a IjBe the weight of speech i in text j, TF IjThe frequency of expression speech i in text j, N is the number of documents of corpus.DF iThe document frequency of expression speech i in this corpus promptly occurs in what piece documents.Text 1 is the abbreviation text, and text 2 is a source phrase text.
Calculate the text similarity of the two then according to existing cosine similarity formula, write file at last.
The text similarity of abbreviation and source phrase
Figure GSB00000639575400074
(formula two)
Wherein, The corresponding vector of source phrase text that on behalf of the source phrase, the corresponding vector of abbreviation text that on behalf of abbreviation,
Figure GSB00000639575400075
adopt search engine to obtain,
Figure GSB00000639575400076
adopt the rustling sound engine to obtain.
Figure GSB00000639575400077
representation vector
Figure GSB00000639575400078
and vector
Figure GSB00000639575400079
multiply each other, and the mould of mould of
Figure GSB000006395754000710
representation vector
Figure GSB000006395754000711
and vector multiplies each other.
320:, obtain total similarity according to web page interlinkage similarity, co-occurrence similarity and text similarity.
Sequencing is not distinguished in the calculating of above-mentioned web page interlinkage similarity, co-occurrence similarity and text similarity.After web page interlinkage similarity, co-occurrence similarity and text similarity all calculate; Can be according to actual conditions or experience; Give each similarity regulation weighted value (a number percent number), calculate total similarity of web page interlinkage similarity, co-occurrence similarity and text similarity then according to the weighted value of regulation.
330: greater than predetermined threshold value, then current candidate is to reservation as if above-mentioned total similarity, otherwise removal.
The predetermined threshold value here can be definite through testing, and is the number between 0 to 1.
The embodiment of the invention can obtain abbreviation, the right preferable results set of source phrase through above-mentioned steps through computing machine fast.Also utilize the clicks information of abbreviation and source phrase in the user inquiring daily record in the present embodiment, use the weighted harmonic mean number (seeing formula three) of the two to come the calculated recommendation value:
Figure GSB00000639575400081
(formula three),
Recommendation is big more, and it is many more to explain that people use, and popularity is good more, recommends thereby carry out hot topic.
The embodiment of the invention is utilized the user inquiring daily record; From the user inquiring daily record, excavate the Chinese abbreviation; And through a series of filtration and method of selection; From the real corpus storehouse, obtain abbreviation, the right preferable results set of source phrase fast, improved abbreviation, the right ageing and accuracy of source phrase.
Embodiment two
The embodiment of the invention provides a kind of Chinese abbreviation treating apparatus, and is as shown in Figure 4, comprising:
Pre-processing module 401 is used for all query words of user inquiring daily record are carried out pre-service.
Remove the noise query word in the user inquiring daily record.The noise query word here mainly is meant the query word that contains outer Chinese character, mess code.Above-mentioned pre-service also comprises the numeral of filtering in the query word, full-shape letter, punctuation mark, space etc.
Related term concentrating module 402, being used for the query word of the same catalogue in the pretreated inquiry log same website of sensing is assembled is one group, obtains a plurality of groups.
After the process pre-service of pre-processing module 401, the query word entry of reservation is normal Chinese Query speech mostly.In inquiry log, a record generally comprises following content: query word, a URL (Uniform Resource Locator, URL is also referred to as web page address) and the corresponding user's clicks of this URL that this query word is corresponding.The query word that points to the same catalogue in same website refers in inquiry log, and corresponding URL is for pointing to the query word of the same catalogue in same website.The query word that corresponding URL points to the same catalogue in same website is comparatively relevant.For example: the URL that query word " Peking University " is corresponding is www.pku.edu.cn; And " Beijing University " also corresponding to www.pku.edu.cn, and therefore " Peking University " is higher with " Beijing University " correlativity.
Identical URL is the URL that points to the same catalogue in same website.For different URL, in the embodiment of the invention, judge that different URL is that the method for pointing to the URL of the same catalogue in same website is:
URL generally is with http: // beginning, in its network address of intercepting the 3rd "/" before, http: // afterwards part, (as: http://sports.sohu.com/nba.shtml only keeps sports.sohu.com to the result after obtaining keeping; If only contain three "/", the 3rd "/" do not have content, then http afterwards: // after content all keep, only keep www.sohu.com such as http://www.sohu.com/).If different URL according to the method described above, the result after the reservation is identical, thinks that then they are the URL that point to the same catalogue in same website.
The candidate is used for the query word to each group that is obtained by related term concentrating module 402 to generation module 403, carries out: a plurality of candidates that generate interior source phrase of this group and abbreviation coupling according to the word alignment rule are right.
The word alignment rule refers to: (1) number of words is less thinks abbreviation, and number of words is more thinks the source phrase; (2) all order appearance in the phrase of source of each word in the abbreviation.This shows, if all order appearance in query word B of each word among the query word A, and query word A lacks than query word B number of words; Then be selected; Source phrase and abbreviation matched candidate as the embodiment of the invention are right, and wherein query word A is an abbreviation, and query word B is the source phrase.
In the present embodiment, right according to all candidates of source phrase in the word alignment rule generation present group and abbreviation coupling.Such as; Related term concentrating module 402 with the query word that points to the same catalogue in same website in the pretreated inquiry log assemble be one group after, the group polling speech in obtain a plurality of groups is: " Beijing University ", " Beijing University gives birth to ", " Peking University ", " Peking University gives birth to " " Beijing University's enrollment of universities " four speech, the candidate who so here extracts according to the word alignment rule is to being: (Beijing University, Beijing University give birth to); (Beijing University, Peking University); (Beijing University, Peking University give birth to), (Beijing University gives birth to, Peking University give birth to), (Peking University, Peking University give birth to); (Beijing University, Beijing University's enrollment of universities); (Peking University, Beijing University's enrollment of universities), (Beijing University's life, Beijing University's enrollment of universities), a candidate is right in each bracket.
Filtering module 404, it is right to be used for for each candidate, if source phrase wherein has place name, and does not have morpheme corresponding with this place name in the abbreviation wherein, then filters out the place name in this source phrase.
Extract the candidate to after because place name can impact abbreviation, therefore want the place name of selective filter candidate centering.If source phrase wherein has place name, and do not have morpheme corresponding in the abbreviation wherein, then filter out the place name in the phrase of said source with said place name.As for the candidate to (in the city one, first middle school, Shenyang City), " Shenyang " in the source phrase " first middle school, Shenyang City " does not have corresponding morpheme in abbreviation " in the city one " here; And " city " do not have any meaning for the identification of abbreviation, so this candidate is to being treated to the candidate to (in one, first middle school); Correspondingly; (in the Shenyang one, first middle school, ShenYang, Liaoning Province city) also should be treated to the candidate to (in one, first middle school); And for the candidate to the place name " Beijing " in (Beijing University, Peking University) owing to there is " north " corresponding with it in the abbreviation " Beijing University ", and do not have " Beijing " in the abbreviation, so should keep, promptly this is not to dealing with.
Screening module 405 is used for screening according to the result of preset rules after to this group inner filtration, obtains this and organizes endogenous phrase and the right set of abbreviation.
Screening module 405 specifically comprises:
First submodule is used for the result behind this group inner filtration, and to removing, the candidate who does not comprise name is to keeping to the candidate that comprises name, obtains the result after the screening for the first time.
Such as for the candidate to (Wang Wei, Wang Wei's elder brother), because " Wang Wei " be name, so this a pair of will removal from candidate's centering of present group.
Candidate's centering also often exists an abbreviation corresponding to the phenomenon that surpasses 3 source phrase in addition; Such as the candidate's centering abbreviation " Beijing University " described in the step 130 corresponding surpass 3 source phrase; Therefore also according to the clicks information of query word in the user inquiring daily record abbreviation is become " a pair of three " corresponding to the phenomenon of multiple source phrase in the present embodiment, promptly an abbreviation is at the most corresponding to three source phrases.Specific practice is: choose three maximum source phrases of user click frequency in the corresponding source phrase of same abbreviation.
Second submodule is used for the result after screening for the first time, and the speech that keeps the head and the tail of source phrase has the candidate of corresponding morpheme right in abbreviation, obtains the result behind the programmed screening.
The speech of the head and the tail of above-mentioned source phrase has the candidate of corresponding morpheme to comprising three kinds of situation in abbreviation: (1) morpheme constitutes situation: promptly in the phrase of source each speech corresponding to each speech in the abbreviation, like Peking University---Beijing University; (2) mix: promptly mixing method constitutes, and abbreviation is compared with the source phrase and do not lacked any speech, as: setting-up exercises to radio music---setting-up exercises to music; (3) deletion condition: i.e. some speech of intercalary delection, every speech one word of remaining speech, and head and the tail do not lack speech, as: the People's Republic of China (PRC)---China.
The 3rd submodule is used for the result behind programmed screening, according to web page interlinkage similarity, co-occurrence similarity and text similarity to the candidate to screening, obtain this and organize endogenous phrase and the right set of abbreviation.The 3rd submodule specifically comprises:
First module is used for the result behind programmed screening, and each candidate is right as current candidate to respectively, carries out: calculate the right web page interlinkage similarity of current candidate, co-occurrence similarity and text similarity respectively.
Web page interlinkage calculation of similarity degree method specifically comprises:
(1) abbreviation and the source phrase of current candidate's centering are searched for as query word inputted search engine respectively, and in all results that search engine searches obtains, got the Search Results of this source phrase of Search Results and first preset number of this abbreviation of first preset number.In the present embodiment, in all results that search engine searches obtains, get the Search Results of preceding 20 these abbreviations and the Search Results of preceding 20 these source phrases.If 20 of the less thaies as a result that search engine searches obtains, with its actual number that obtains as first preset number.
(2) similar counting is carried out initialization (being initialized as 0 in the present embodiment), and to each bar in the Search Results of this abbreviation of taking out respectively as work as preceding article, execution:
To work as the corresponding web page interlinkage of preceding article, compare successively, when identical, stop comparison, said similar counting will be added 1 with the corresponding web page interlinkage of Search Results of this source phrase that takes out; Present embodiment is in when comparison, the first order of the corresponding web page interlinkage of comparison search result only, just the 3rd "/" part before in the said network address in the step 120.
After each bar in the Search Results of this abbreviation that takes out is all accomplished comparison with the Search Results of this source phrase that takes out; Calculate the web page interlinkage similarity according to the similar counting and first preset number; Web page interlinkage calculation of similarity degree formula is: wherein
P 1Be the web page interlinkage similarity, countA is similar counting.
Co-occurrence calculation of similarity degree method specifically comprises:
(1) with the abbreviation of current candidate's centering and source phrase simultaneously as query word inputted search engine search, and in the result of search engine searches, get the Search Results of second preset number.In the present embodiment, in the result of search engine searches, get preceding 20 Search Results.If 20 of the less thaies as a result that search engine searches obtains, with its actual number that obtains as second preset number.
(2) the co-occurrence number of times is carried out initialization (being initialized as 0 in the present embodiment), as current summary, carry out: if this abbreviation all occurred with this source phrase in the current summary, then the co-occurrence number of times adds 1 the summary of each Search Results correspondence of taking out.
Calculate the co-occurrence similarity according to the co-occurrence number of times and second preset value; Co-occurrence calculation of similarity degree formula is:
Figure GSB00000639575400111
wherein
P 2Be the co-occurrence similarity, countB is the co-occurrence number of times.
Present embodiment calculates text similarity by the method for prior art text classification, specifically comprises:
(1) use corpus (such as " macropaedia ") to train; Divide morphology (binary divides morphology to be the double word cutting, such as a sentence " I like Tsing-Hua University ", adopts the binary participle can obtain 5 speech: " I like ", " liking clear ", " Tsing-Hua University ", " China is big ", " university ") through binary; Select 6; 0000 speech is as character representation, so the dimension of proper vector is 6,0000 dimensions.
(2) right for the candidate among the result who obtains behind the above-mentioned programmed screening, abbreviation and source phrase are sent into search engine inquiry as keyword respectively, and respectively preceding 20 summaries are separately write file as one piece of text.
(3) to above-mentioned file, for the text of two retrieval of content of abbreviation and source phrase, if the speech that obtains in the text behind the binary participle belong to 6,0000 tie up scopes speech, calculate the weight (formula calculating calculated as described below) of each speech, the weight a of each speech IjBe the value of each dimension on the proper vector, thus, the abbreviation text obtains the vector of a correspondence
Figure GSB00000639575400112
Source phrase text also obtains the vector of a correspondence
Figure GSB00000639575400113
Speech weight calculation formula: a Ij = Log ( TF Ij + 1.0 ) * Log ( N / DF i ) Σ k [ Log ( TF Kj + 1.0 ) * Log ( N / DF k ) ] 2 (formula one),
Wherein, i ∈ [1,60000], j are 1 or 2, a IjBe the weight of speech i in text j, TF IjThe frequency of expression speech i in text j, N is the number of documents of corpus.DF iThe document frequency of expression speech i in this corpus promptly occurs in what piece documents.Text 1 is the abbreviation text, and text 2 is a source phrase text.
Calculate the text similarity of the two then according to existing cosine similarity formula, write file at last.
The text similarity of abbreviation and source phrase
Figure GSB00000639575400115
(formula two)
Wherein, The corresponding vector of source phrase text that on behalf of the source phrase, the corresponding vector of abbreviation text that on behalf of abbreviation,
Figure GSB00000639575400116
adopt search engine to obtain,
Figure GSB00000639575400117
adopt the rustling sound engine to obtain.
Figure GSB00000639575400118
representation vector
Figure GSB00000639575400119
and vector
Figure GSB000006395754001110
multiply each other, and the mould of mould of
Figure GSB000006395754001111
representation vector
Figure GSB000006395754001112
and vector
Figure GSB000006395754001113
multiplies each other.
Unit second is used for obtaining total similarity according to web page interlinkage similarity, co-occurrence similarity and text similarity.
After web page interlinkage similarity, co-occurrence similarity and text similarity all calculate; Can be according to actual conditions or experience; Give each similarity regulation weighted value (a number percent number), the weighted value according to regulation calculates total similarity of having taken all factors into consideration web page interlinkage similarity, co-occurrence similarity and text similarity then.
Unit the 3rd, be used for if said total similarity greater than predetermined threshold value, then current candidate is to keeping, otherwise removes.
The predetermined threshold value here can be definite through testing, and is the number between 0 to 1.
The embodiment of the invention is utilized the user inquiring daily record; From the user inquiring daily record, excavate the Chinese abbreviation; And, from the real corpus storehouse, obtain abbreviation, the right preferable results set of source phrase fast through filtering module and screening module, improved abbreviation, the right ageing and accuracy of source phrase.
The embodiment of the invention can utilize software to realize that corresponding software programs can be stored in the storage medium that can read, for example, and in the hard disk of computing machine, buffer memory or the CD.
The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being done, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (8)

1. a Chinese abbreviation disposal route is applied to retrieve in the search engine, it is characterized in that, comprising:
All query words in the user inquiring daily record carry out pre-service;
It is one group that the query word that points to the same catalogue in same website in the pretreated inquiry log is assembled, and obtains a plurality of groups;
To the query word in each group, carry out:
The a plurality of candidates that generate interior source phrase of this group and abbreviation coupling according to the word alignment rule are right;
Right for each candidate, if source phrase wherein has place name, and there is not morpheme corresponding in the abbreviation wherein with said place name, then filter out the place name in the phrase of said source;
Result after to this group inner filtration screens according to preset rules, obtains this and organizes endogenous phrase and the right set of abbreviation, specifically comprises:
In the result of this group behind inner filtration, to removing, the candidate who does not comprise name is to keeping to the candidate that comprises name, obtains the result after the screening for the first time;
Among the result after screening for the first time, the speech that keeps the head and the tail of source phrase has the candidate of corresponding morpheme right in abbreviation, obtains the result behind the programmed screening;
Among the result behind programmed screening, according to web page interlinkage similarity, co-occurrence similarity and text similarity to the candidate to screening, obtain this and organize endogenous phrase and the right set of abbreviation.
2. Chinese abbreviation disposal route as claimed in claim 1 is characterized in that, saidly all query words in the user inquiring daily record are carried out pre-service is specially:
Remove the query word that contains outer Chinese character, mess code in the user inquiring daily record, and filter numeral, punctuation mark or space in the query word.
3. Chinese abbreviation disposal route as claimed in claim 1; It is characterized in that, among the said result behind programmed screening, according to web page interlinkage similarity, co-occurrence similarity and text similarity to the candidate to screening; Obtain this and organize endogenous phrase and the right set of abbreviation, specifically comprise:
Among the result behind programmed screening, each candidate is right as current candidate to respectively, carry out:
Calculate the right web page interlinkage similarity of current candidate, co-occurrence similarity and text similarity respectively;
According to said web page interlinkage similarity, co-occurrence similarity and text similarity, obtain total similarity;
If said total similarity is greater than predetermined threshold value, then current candidate is to reservation, otherwise removal.
4. Chinese abbreviation disposal route as claimed in claim 3 is characterized in that, said web page interlinkage calculation of similarity degree method specifically comprises:
The abbreviation and the source phrase of current candidate's centering are searched for as query word inputted search engine respectively, and in all results that search engine searches obtains, got the Search Results of said source phrase of Search Results and first preset number of the said abbreviation of first preset number;
Similar counting is carried out initialization, and to each bar in the Search Results of the said abbreviation that takes out respectively as work as preceding article, execution:
To work as the corresponding web page interlinkage of preceding article, compare successively, when identical, stop comparison, said similar counting will be added 1 with the corresponding web page interlinkage of Search Results of the said source phrase that takes out;
After each bar in the Search Results of the said abbreviation that takes out is all accomplished comparison with the Search Results of the said source phrase that takes out; Calculate the web page interlinkage similarity according to the said similar counting and first preset number; Said web page interlinkage calculation of similarity degree formula is: wherein
P 1Be the web page interlinkage similarity, countA is said similar counting.
5. Chinese abbreviation disposal route as claimed in claim 3 is characterized in that, said co-occurrence calculation of similarity degree method specifically comprises:
The abbreviation of current candidate's centering and source phrase simultaneously as query word inputted search engine search, and are got the Search Results of second preset number in the result of search engine searches;
The co-occurrence number of times is carried out initialization, and as current summary, carry out: if abbreviation described in the current summary and said source phrase all occurred, then said co-occurrence number of times adds 1 the summary of each said Search Results correspondence of taking out;
Calculate the co-occurrence similarity according to said co-occurrence number of times and said second preset number; Said co-occurrence calculation of similarity degree formula is: wherein
P 2Be the co-occurrence similarity, countB is said co-occurrence number of times.
6. a Chinese abbreviation treating apparatus is applied to retrieve in the search engine, it is characterized in that, comprising:
Pre-processing module is used for all query words of user inquiring daily record are carried out pre-service;
The related term concentrating module, being used for the query word of the same catalogue in the pretreated inquiry log same website of sensing is assembled is one group, obtains a plurality of groups;
The candidate is used for the query word to each group to generation module, carries out: a plurality of candidates that generate interior source phrase of this group and abbreviation coupling according to the word alignment rule are right;
Filtering module, it is right to be used for for each candidate, if source phrase wherein has place name, and does not have morpheme corresponding with said place name in the abbreviation wherein, then filters out the place name in the phrase of said source;
Screening module is used for screening according to the result of preset rules after to this group inner filtration, obtains this and organizes endogenous phrase and the right set of abbreviation, specifically comprises:
First submodule is used for the result behind this group inner filtration, and to removing, the candidate who does not comprise name is to keeping to the candidate that comprises name, obtains the result after the screening for the first time;
Second submodule is used for the result after screening for the first time, and the speech that keeps the head and the tail of source phrase has the candidate of corresponding morpheme right in abbreviation, obtains the result behind the programmed screening;
The 3rd submodule is used for the result behind programmed screening, according to web page interlinkage similarity, co-occurrence similarity and text similarity to the candidate to screening, obtain this and organize endogenous phrase and the right set of abbreviation.
7. Chinese abbreviation treating apparatus as claimed in claim 6 is characterized in that said pre-processing module specifically is used for, and removes the query word that contains outer Chinese character, mess code in the user inquiring daily record, and filters numeral, punctuation mark or space in the query word.
8. Chinese abbreviation treating apparatus as claimed in claim 6 is characterized in that, said the 3rd submodule comprises:
First module is used for the result behind programmed screening, and each candidate is right as current candidate to respectively, carries out: calculate the right web page interlinkage similarity of current candidate, co-occurrence similarity and text similarity respectively;
Unit second is used for obtaining total similarity according to said web page interlinkage similarity, co-occurrence similarity and text similarity;
Unit the 3rd, be used for if said total similarity greater than predetermined threshold value, then current candidate is to keeping, otherwise removes.
CN2009100883776A 2009-07-02 2009-07-02 Chinese abbreviation processing method and device therefor Expired - Fee Related CN101599075B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100883776A CN101599075B (en) 2009-07-02 2009-07-02 Chinese abbreviation processing method and device therefor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100883776A CN101599075B (en) 2009-07-02 2009-07-02 Chinese abbreviation processing method and device therefor

Publications (2)

Publication Number Publication Date
CN101599075A CN101599075A (en) 2009-12-09
CN101599075B true CN101599075B (en) 2012-02-29

Family

ID=41420523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100883776A Expired - Fee Related CN101599075B (en) 2009-07-02 2009-07-02 Chinese abbreviation processing method and device therefor

Country Status (1)

Country Link
CN (1) CN101599075B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5898153B2 (en) * 2013-09-05 2016-04-06 京セラドキュメントソリューションズ株式会社 Abbreviation management program, abbreviation management device, full spell display program, and full spell display device
CN103605641A (en) * 2013-11-11 2014-02-26 清华大学 Method and device for automatically discovering Chinese abbreviations
CN103646017B (en) * 2013-12-11 2017-01-04 南京大学 Acronym generating system for naming and working method thereof
CN104281565B (en) * 2014-09-30 2017-09-05 百度在线网络技术(北京)有限公司 Semantic dictionary construction method and device
CN107491537A (en) * 2017-08-23 2017-12-19 北京百度网讯科技有限公司 POI data excavation, information retrieval method, device, equipment and medium

Also Published As

Publication number Publication date
CN101599075A (en) 2009-12-09

Similar Documents

Publication Publication Date Title
Sharoff Creating general-purpose corpora using automated search engine queries
Kipp Complementary or Discrete Contexts in Online Indexing: A Comparison of User, Creator, and Intermediary Keywords.
Han et al. Lexical normalisation of short text messages: Makn sens a# twitter
US20060206306A1 (en) Text mining apparatus and associated methods
Manjari et al. Extractive Text Summarization from Web pages using Selenium and TF-IDF algorithm
CN103678412A (en) Document retrieval method and device
Al-Taani et al. An extractive graph-based Arabic text summarization approach
CN101599075B (en) Chinese abbreviation processing method and device therefor
Fodil et al. Theme classification of Arabic text: A statistical approach
US20070239735A1 (en) Systems and methods for predicting if a query is a name
Lim et al. Automatic genre detection of web documents
Chaibi et al. Topic segmentation for textual document written in arabic language
Reddy et al. An efficient approach for web document summarization by sentence ranking
Dalianis et al. SweNam-A Swedish Named Entity recognizer. Its construction, training and evaluation
Matsuoka et al. Examination of effective features for CRF-based bibliography extraction from reference strings
Rundell The corpus revolution revisited
Ung et al. Combination of features for vietnamese news multi-document summarization
Duan et al. Research on Enterprise Track of TREC 2007 at SJTU APEX Lab.
Schilit et al. Exploring a digital library through key ideas
CN105426551A (en) Classical Chinese searching method and device
Riaz Concept search in Urdu
Zhang et al. A tag recommendation system based on contents
Manne et al. A Feature Terms based Method for Improving Text Summarization with Supervised POS Tagging
Varges Instance-based natural language generation
Thanadechteemapat et al. Thai word segmentation for visualization of thai web sites

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120229

Termination date: 20120702