CN102236637B - Method and system for determining collocation degree of collocations with central word - Google Patents

Method and system for determining collocation degree of collocations with central word Download PDF

Info

Publication number
CN102236637B
CN102236637B CN 201010158112 CN201010158112A CN102236637B CN 102236637 B CN102236637 B CN 102236637B CN 201010158112 CN201010158112 CN 201010158112 CN 201010158112 A CN201010158112 A CN 201010158112A CN 102236637 B CN102236637 B CN 102236637B
Authority
CN
China
Prior art keywords
word
collocation
centre
language material
pairs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201010158112
Other languages
Chinese (zh)
Other versions
CN102236637A (en
Inventor
张宇峰
陈学文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Original Assignee
Beijing Kingsoft Software Co Ltd
Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Software Co Ltd, Beijing Jinshan Digital Entertainment Technology Co Ltd filed Critical Beijing Kingsoft Software Co Ltd
Priority to CN 201010158112 priority Critical patent/CN102236637B/en
Publication of CN102236637A publication Critical patent/CN102236637A/en
Application granted granted Critical
Publication of CN102236637B publication Critical patent/CN102236637B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention discloses a method and a system for determining a collocation degree of collocations with a central word. In a scheme provided by the embodiment of the invention, a secondary calculation for the collocation degree is performed based on counting the occurrence number of the collocations of the central word in a corpus and the occurrence number of the collocations while being collocated with the central word, for example, the occurrence number of the collocations in the corpus is divided by the occurrence number of the collocations while being collocated with the central word, so that the collocations having lower collocation degree with the central word is effectively deleted from a collocations set, and the collocations having higher collocation degree with the central word are highlighted more; and the collocations are ordered again according to the calculated collocation degree, and the collocations having high collocation degree are those combined with the central word most closely.

Description

A kind of definite collocation word and centre word collocation degree methods and system
Technical field
The present invention relates to networking technology area, relate in particular to a kind of definite collocation word and centre word collocation degree methods and system.
Background technology
Along with constantly popularizing of network technology, network has incorporated the various aspects of people's routine work and life.The universal quantity of information that makes of network constantly expands.In information access process, the understanding of certain word is subjected to and the often influence of the word of collocation of this word to a great extent.In the embodiment of the present application, for simplicity, certain word is called centre word, the word that occurs with centre word collocation is the word of arranging in pairs or groups.
The inventor is by to the discovering of prior art, along with the increase of quantity of information, in the face of numerous and jumbled numerous and complicated information, present technology can't be determined from the word that numerous and centre word collocation occur and the centre word collocation word of arranging in pairs or groups the most closely exactly.
Summary of the invention
In view of this, the purpose of the embodiment of the invention provides a kind of definite collocation word and centre word collocation degree methods and system, realizes determining exactly from the word that numerous and centre word collocation occur and the centre word collocation word of arranging in pairs or groups the most closely.
For achieving the above object, the embodiment of the invention provides following technical scheme:
A kind of definite collocation word and centre word collocation degree methods comprise:
Obtain language material, determine centre word, and determine the collocation set of words corresponding with centre word;
The occurrence number of each collocation word in language material in the statistical collocation set of words;
The number of times that each collocation word in the statistical collocation set of words and centre word are arranged in pairs or groups in language material and occurred;
The number of times that occurs and the occurrence number of word in language material of arranging in pairs or groups are calculated the collocation degree of each collocation word and centre word to utilize collocation word and centre word to arrange in pairs or groups in language material;
Determine the collocation degree of each collocation word and centre word in conjunction with described collocation degree.
The total degree that utilize collocation word and centre word arrange in pairs or groupss in the language material number of times that occurs and the word of arranging in pairs or groups occurs in language material calculates each collocation degree of arranging in pairs or groups word and centre word and comprises:
The number of times that utilizes collocation word and centre word arrange in pairs or groupss in language material to occur calculates the collocation degree of each arrange in pairs or groups word and centre word divided by the occurrence number of word in language material of arranging in pairs or groups.
Obtaining language material, and after definite centre word, calculating before the collocation degree of each collocation word and centre word, also comprising:
Total word number in the statistics language material;
The collocation degree that calculates each collocation word and centre word is specially:
The number of times that utilizes collocation word and centre word in language material, arrange in pairs or groupss to occur, occurrence number and total word number in the language material collocation degree that calculate each arrange in pairs or groups word and centre word of word in language material of arranging in pairs or groups.
The total degree that the number of times that occurs that utilizes collocation word and centre word arrange in pairs or groupss in language material, the word of arranging in pairs or groups occur in language material and the total word number in the language material calculate each collocation degree of arranging in pairs or groups word and centre word and are specially:
Calculate the collocation degree of each collocation word and centre word according to following formula:
( i ) term _ i × log ( allcnt ÷ document _ i ) ;
Wherein, f (i) represents the collocation degree of certain centre word and collocation word i; The trem_i number of times that word i and centre word are arranged in pairs or groups in language material and occurred of representing to arrange in pairs or groups; The document_i number of times that word i occurs in the middle of whole language material of representing to arrange in pairs or groups; Allcnt represents the total word number in the language material.
Said method also comprises:
The collocation word of centre word is provided to the user according to the collocation degree.
The system of a kind of definite collocation word and centre word collocation degree comprises:
Pretreatment unit is used for obtaining language material, determines centre word, and determines the collocation set of words corresponding with centre word;
First statistic unit is for each collocation word occurrence number in language material of statistical collocation set of words;
Second statistic unit is used for each collocation word of statistical collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred in language material;
Computing unit, the total degree that be used for to utilize number of times that collocation word and centre word occur in the language material collocation and collocation word to occur in language material calculates the collocation degree of each arrange in pairs or groups word and centre word;
Second determining unit is determined the collocation degree of each collocation word and centre word in conjunction with described collocation degree.
Described computing unit comprises:
First obtains subelement, is used for obtaining from described first statistic unit occurrence number of each collocation word language material of collocation set of words;
Second obtains subelement, is used for obtaining each collocation word of collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred from described second statistic unit language material;
First computation subunit is used for utilizing collocation word and centre word to calculate the collocation degree of each collocation word and centre word divided by the occurrence number of collocation word in language material at the number of times of language material collocation appearance.
Said system also comprises:
The 3rd statistic unit is for total word number of statistics language material;
Described computing unit also comprises:
The 3rd obtains subelement, is used for obtaining from described the 3rd statistic unit total word number of language material;
Second computation subunit, the collocation degree that be used for to utilize number of times that collocation word and centre word occur in the language material collocation, total degree that the collocation word occurs in language material and the total word number in the language material to calculate each arrange in pairs or groups word and centre word.
Described second computation subunit is calculated the collocation degree of each collocation word and centre word according to following formula:
( i ) term _ i × log ( allcnt ÷ document _ i ) ;
Wherein, f (i) represents the collocation degree of certain centre word and collocation word i; The trem_i number of times that word i and centre word are arranged in pairs or groups in language material and occurred of representing to arrange in pairs or groups; The document_i number of times that word i occurs in the middle of whole language material of representing to arrange in pairs or groups; Allcnt represents the total word number in the language material.
Said system also comprises:
Assist in cell is for the collocation word that centre word is provided to the user according to the collocation degree.
As seen, the scheme that the embodiment of the invention provides to the collocation word of centre word in language material occurrence number and carry out on the statistical basis with number of times that the centre word collocation occurs, carry out secondary calculating again, for example the number of times that occurs by arranging in pairs or groups with collocation word and centre word is divided by the occurrence number of collocation word in language material, to from the collocation set of words, reject with the lower collocation word of centre word degree of collocation effectively, highlighted the collocation word high with centre word collocation degree more, from big to small the collocation word is resequenced according to the collocation degree that calculates, the collocation word that the collocation degree is positioned at the prostatitis is is combined the word of arranging in pairs or groups the most closely with centre word.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, the accompanying drawing that describes below only is some embodiment that put down in writing among the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
The method flow diagram that Fig. 1 provides for one embodiment of the invention;
The process flow diagram of the method that Fig. 2 provides for another embodiment of the present invention;
The structural representation of the system that Fig. 3 provides for one embodiment of the invention;
The structural representation of a unit in the system that Fig. 4 provides for one embodiment of the invention;
The structural representation of the system that Fig. 5 provides for another embodiment of the present invention;
The structural representation of a unit in the system that Fig. 6 provides for another embodiment of the present invention.
Embodiment
In order to make those skilled in the art person understand technical scheme among the present invention better, below in conjunction with the accompanying drawing in the embodiment of the invention, technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills should belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.
Referring to Fig. 1, a kind of definite collocation word that the embodiment of the invention provides and centre word collocation degree methods comprise:
S101, obtain language material, language material is carried out word segmentation processing, determine centre word, and determine the collocation set of words corresponding with centre word;
Language material is the basis of the scheme that provides of the embodiment of the invention, and language material can be by the needs of user according to oneself, collects in conjunction with specific field to form.Also there are at present many disclosed corpus can be for people to use on the network.The language of language material can be Chinese, English or other language, and the application does not do restriction to this, and for convenience of description, the embodiment of the invention is that example describes with Chinese language material.
The object that the present invention studies is word, and the embodiment of the invention is that example is described with Chinese.After obtaining language material, can carry out word segmentation processing to language material earlier, language material is become the set of a large amount of words.Certainly, for the convenience of follow-up use, among some embodiment, can also carry out some pre-service to language material, for example remove onomatopoeia in the language material etc.
After the participle, also need the definite centre word that will study.Embodiment of the invention problem to be solved is namely determined and the centre word collocation word of arranging in pairs or groups the most closely in conjunction with language material, and then can be carried out various application in conjunction with the collocation word.Determining and can carrying out according to actual needs of centre word needs to determine the collocation word of which word, just can be with which word as centre word.
Generally speaking, the volume ratio of language material is abundanter, also can be many with the collocation word of centre word collocation, in the embodiment of the invention, the collocation set of words all can be put in all collocation words of centre word.
The occurrence number of each collocation word in language material in S102, the statistical collocation set of words;
The number of times that each collocation word in S103, the statistical collocation set of words and centre word are arranged in pairs or groups in language material and occurred;
In the embodiment of the invention, the collocation word is the concept relative with centre word.In fact each collocation word all is an independently word, and the collocation word except occurring with the centre word collocation, also may independently occur in language material, and perhaps the word collocation with other occurs.In the embodiment of the invention, quantity and collocation word that total word number, collocation word in the language material occur are added up respectively with the number of times of centre word collocation appearance.
S104, the number of times that occurs that utilizes collocation word and centre word arrange in pairs or groupss in language material and the occurrence number of word in language material of arranging in pairs or groups are calculated the collocation degree of each arrange in pairs or groups word and centre word;
In the embodiment of the invention, can calculate the collocation degree of each arrange in pairs or groups word and centre word divided by the appearance total degree of word in language material of arranging in pairs or groups with in language material, the arrange in pairs or groups number of times that occurs of collocation word and centre word.
S105, determine the collocation degree of each collocation word and centre word in conjunction with described collocation degree.
In the practical application, can the collocation word be sorted in conjunction with the collocation degree of collocation word and centre word, the collocation word that the collocation degree is high and the collocation degree of centre word are higher than the collocation degree of the low collocation word of collocation degree and centre word.
Prior art is when determining collocation word and centre word collocation degree, general only definite according to the number of times of collocation word and centre word collocation appearance, because the factor of reference is too simple, cause determined centre word under a lot of situations and the collocation degree of collocation word not to conform to the actual conditions, very inaccurate.
In the embodiment of the invention, the number of times that occurs and the occurrence number of word in language material of arranging in pairs or groups are calculated the collocation degree of each collocation word and centre word to utilize collocation word and centre word to arrange in pairs or groups in language material, namely when determining collocation word and centre word collocation degree, except the number of times of considering collocation word and centre word collocation appearance, also considered the collocation word from the occurrence number in language material, the collocation word that the method that provides by the embodiment of the invention is determined can reflect more truly with the collocation degree of centre word and influence to the word itself of arranging in pairs or groups appears in the centre word collocation.For example, if a word A occurs in language material 100 times, occur 30 times with centre word M collocation; Another word B occurs in language material 50 times, occurs 25 times with centre word M collocation, if so according to the method for prior art, the collocation degree of word B and centre word M is higher than the collocation degree of word A and centre word M.But in fact, because word B has occurred 50 times altogether, have 25 times and all occur with centre word M collocation, the collocation occurrence number accounts for 50% of the occurrence number of word B in language material; , namely the collocation degree of word B and centre word M is 50%; And word A and centre word M collocation occurrence number account for 30% of the occurrence number of word B in language material, the collocation degree that is word A and centre word M is 30%, the method that provides according to the embodiment of the invention, the collocation degree of word B and centre word M is higher than the collocation degree of word A and centre word M, and the collocation degree of word B and centre word M is higher than the collocation degree of word A and centre word M.
In the practical application, can also be optimized in conjunction with the collocation degree of more information to collocation word and centre word, for example, in another embodiment of the present invention, can after step S101, before step S105, increase following steps:
Total word number in the statistics language material.
In general, language material is in case definite, and the total word number in the language material has just been decided, and for whole language material, the total word number in the language material is equivalent to a constant.
Behind total word number in obtaining language material, when spending, the collocation of calculating collocation word and centre word can carry out further combined with total word number of language material.For example, utilize some functions that number of times and the occurrence number of collocation word in language material of collocation word and centre word collocation appearance are carried out smoothing processing.
Formula 1 is a kind of computing formula of calculating the collocation degree of collocation word and centre word that one embodiment of the invention provides.
( i ) term _ i × log ( allcnt ÷ document _ i ) Formula 1
Wherein, f (i) represents the collocation degree of certain centre word and collocation word i, trem_i arrange in pairs or groups in the language material number of times of appearance of word i and centre word of representing to arrange in pairs or groups.In the embodiment of the invention, we represent the collocation set of words that all collocation words of a centre word correspondence are formed with trem.Illustrate: for example the collocation word of centre word " hacker " comprise " ", " instrument " etc., then " " with " hacker " in whole language material, arrange in pairs or groups number of times of appearance be trem_i.The document_i occurrence number of word i in the middle of whole language material of representing to arrange in pairs or groups in the embodiment of the invention, is designated as document with whole language material.For example: the document_i of " instrument " represents the occurrence number of " instrument " this word in whole language material; Allcnt represents the quantity of central all words of whole language material, and in the embodiment of the invention, word is repeatably.
Formula 1 is at term_i/document_i, and the number of times that word i and the centre word collocation of namely arranging in pairs or groups occurs comes out divided by the basis expansion of this formula of occurrence number of this collocation word in whole language material.The more high then centre word of the value of term_i with the collocation word in language material, arrange in pairs or groups the appearance number of times more many; Document_i is more high to represent that then the occurrence number of this collocation word in whole language material is more many.If the occurrence number of collocation word in whole language material is a lot, but the words that the value of term_i/document_i is very little, the situation that collocation word and centre word collocation appearance is described is compared very little with the situation that the collocation word occurs, then arranging in pairs or groups, relatively the degree of influence of this centre word is little for word, and the collocation word that calculates and the collocation degree of centre word are low.
Generally speaking, any one word all becomes rarefaction state in whole language material, considers the sparse property of whole language material, and smoothly the value of term_i and document_i is to the influence of collocation degree.Among one embodiment, can utilize mathematical formulae that trem_i and document_i are carried out smoothing processing.For example, the characteristic of associative function curve, by
Figure GSA00000114271700072
And log (allcnt ÷ document_i) is optimized processing to term_i and document_i.Wherein, in order to eliminate language material length difference to the influence of f (i), use the value of allcnt that formula is proofreaied and correct.
In the embodiment of the invention, it is because the value of document_i is generally all bigger that the occurrence number document_i of collocation word in language material taken the logarithm, language material becomes sparse shape, by the influence of level and smooth document_i to marking of taking the logarithm, has dwindled the gap between the different orders of magnitude.For example, if the document_i of a collocation word A of certain centre word is 100000, the document_i of another collocation word B of this centre word is 1000000, differ ten times with regard to poor between the two, by taking the logarithm, the two gap has become 1 at logarithmic curve, as seen by the document_i curve of having taken the logarithm smoothly.In the method that the embodiment of the invention provides, term_i is higher than the importance of document_i to the importance of collocation degree, so adopt term_i to level and smooth trem_i value the time 1/2Calculate.Using allcnt in the embodiment of the invention is for log (document_i) curve being transferred to the X-axis top divided by document_i, make the value of log (document_i) be one greater than zero value.
Can know that by above-mentioned analysis the computing formula of the calculating collocation degree that the embodiment of the invention provides not is unique, the present invention does not do restriction to the concrete form of collocation degree computing formula, as long as can realize above-mentioned purpose.
Referring to Fig. 2, below in conjunction with an instantiation method that the embodiment of the invention provides is described in detail.
S201, obtain 1,800,000 Chinese sentences as language material, language material is carried out word segmentation processing, determine that current centre word is " hacker ".
As the example of centre word the method that the embodiment of the invention is provided is described with " hacker " in the embodiment of the invention.In the practical application, particular content and the number of centre word are not done restriction, can determine according to actual needs.No matter be a centre word or a plurality of centre word, determine that the collocation word is all identical with centre word collocation degree methods, present embodiment is that example is introduced with a centre word.
M counted in total word in S202, the statistics language material.
S203, determine the collocation word of centre word in the language material, form the collocation set of words.
The occurrence number d_i of each collocation word in language material and number of times t of collocation word and centre word " hacker " collocation appearance in the collocation set of words in S204, the statistics language material i
S205, according to formula 1 calculate in the collocation set of words each collocation word with collocation degree f (i) centre word " hacker ".
f ( i ) t i × log ( M ÷ d _ i ) Formula 2
For example, in the embodiment of the invention, collocation word in the collocation set of words of " hacker " this centre word correspondence comprises:, attack, instrument, software, intrusion, invasion, data, attempt, steal, totally 10 of culture, the number of times that occurs according to collocation word and centre word " hacker " collocation of putting in order namely of these 10 collocation words is from how to few arrangement at present.For convenience, be numbered 1~10 for the unification of these 10 collocation words.For example, " " this collocation word be numbered 1, " invasion " this word of arranging in pairs or groups be numbered 5, and the like.Corresponding, t 5The number of times that this collocation word of expression " invasion " and centre word " hacker " collocation occur, d_5 are represented the occurrence number of " invasion " this collocation word in language material.The collocation degree of this collocation word of f (5) expression " invasion " and centre word " hacker ".Calculate the above-mentioned 10 collocation degree of respectively arranging in pairs or groups word and centre word " hacker " respectively according to formula 2.
S206, according to the collocation degree of these 10 collocation words the collocation word is sorted, choose to be positioned at and come preceding 5 collocation word, offer the user.
By the calculating of formula 2, and according to the collocation degree that calculates to above-mentioned 10 respectively arrange in pairs or groups word rearrangements be from big to small: attack, steal, intrusion, instrument, invade, attempt, data, software, culture.The collocation degree can be come preceding 5 collocation word as attack, steal, intrusion, instrument, invasion offer the user, and prompting user these words are words higher with " hacker " collocation degree, can assisting users by the accuracy of the higher word raising of these collocation degree to the grasp of centre word relevant information.
The method that the embodiment of the invention provides to the collocation word of centre word in language material occurrence number and carry out on the statistical basis with number of times that the centre word collocation occurs, carry out secondary calculating again, for example the number of times that occurs by arranging in pairs or groups with collocation word and centre word is divided by the occurrence number of collocation word in language material, to from the collocation set of words, reject with the lower collocation word of centre word degree of collocation effectively, highlighted the collocation word high with centre word collocation degree more, according to the collocation degree that calculates the collocation word is resequenced, to offer the user with the high collocation word of centre word degree of collocation, and improve and determined and arrange in pairs or groups the most closely accuracy of word of centre word collocation.
In the practical application, determined the higher collocation word of centre word collocation degree, these collocation words can be widely used.For example, aspect the search suggestion, if the user imports " hacker " in the search box, utilize definite scheme of the collocation word of centre word provided by the invention then, determine the collocation word high with " hacker " collocation degree, as " assault ", " hacker steals ", " hacker's intrusion " etc., can show the user by modes such as the drop-down menu collocation word that these and " hacker " collocation degree is high, make things convenient for the user further to search for.
The method that the embodiment of the invention provides can also be applied in the mechanical translation field, for example can provide translation result accurately for mechanical translation according to centre word and the collocation word thereof that the method that the embodiment of the invention provides is determined, polysemant, specific word analysis, the analysis of synonym antonym handled in the collocation word that can also utilize centre word and determine, centre word is analyzed etc.The word of for example arranging in pairs or groups is the bilingual corpora of alignment, the Chinese of correspondence and English when namely arranging in pairs or groups word, then can utilize the attribute of centre word and collocation word analysis and statistical collocation word, confirm the meaning of a word of collocation word, for example contain " make " in a sentence to be translated, " make " is a polysemant, and directly translation can not determine should be translated into what content this moment, and just can determine should how to translate this moment " make " according to the collocation word of make this moment.For example, " make " back is pronoun " it ", can know according to the collocation word statistics of " make ", when when pronoun is combined, make generally can be translated as " make, allow ", thereby can improve the accuracy of translation in conjunction with the collocation word statistics of centre word.
In addition, the centre word that the method for utilizing the embodiment of the invention to provide is determined and collocation word can also be applied to the Knowledge Discovery field, can analyze the attribute of centre word, such as centre word and collocation word are all enclosed part-of-speech tagging, the statistical collocation word then can analytic centre's word the collocation part of speech, as V (verb)+V, V+N etc., thus provide foundation for studying mutually combining between the part of speech.
One embodiment of the invention also provides the system of a kind of definite collocation word and centre word collocation degree, and referring to Fig. 3, this system comprises:
Pretreatment unit 301 is used for obtaining language material, and language material is carried out word segmentation processing, determines centre word, and determines collocation word and the set of centre word collocation degree;
First statistic unit 302 is for each collocation word occurrence number in language material of statistical collocation set of words;
Second statistic unit 303 is used for each collocation word of statistical collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred in language material;
Computing unit 304, the total degree that be used for to utilize number of times that collocation word and centre word occur in the language material collocation and collocation word to occur in language material calculates the collocation degree of each arrange in pairs or groups word and centre word;
Determining unit 305 is determined the collocation degree of each collocation word and centre word in conjunction with described collocation degree.
Particularly, referring to Fig. 4, described computing unit 305 comprises:
First obtains subelement 401, is used for obtaining from described first statistic unit occurrence number of each collocation word language material of collocation set of words;
Second obtains subelement 402, is used for obtaining each collocation word of collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred from described second statistic unit language material;
First computation subunit 403 is used for utilizing collocation word and centre word to make the collocation degree that the merchant calculates each collocation word and centre word at language material the collocation number of times that occurs and the total degree that the collocation word occurs in language material.
Prior art is when determining collocation word and centre word collocation degree, general only definite according to the number of times of collocation word and centre word collocation appearance, because the factor of reference is too simple, cause determined centre word under a lot of situations and the collocation degree of collocation word not to conform to the actual conditions, very inaccurate, make the user also inaccurate in conjunction with the information of the centre word that gets access to of collocation word.
In the embodiment of the invention, the total degree that utilize collocation word and centre word arrange in pairs or groupss in the language material number of times that occurs and the word of arranging in pairs or groups occurs in language material calculates the collocation degree of each arrange in pairs or groups word and centre word, namely when determining collocation word and centre word collocation degree, except the number of times of considering collocation word and centre word collocation appearance, also considered the collocation word from the occurrence number in language material, the collocation word that the scheme that provides by the embodiment of the invention is determined can reflect more truly with the collocation degree of centre word and influence to the word itself of arranging in pairs or groups appears in the centre word collocation.
Referring to Fig. 5, the system of a kind of definite collocation word that provides for another embodiment of the present invention and centre word collocation degree, this system comprises outside the pretreatment unit 301 identical with system shown in Figure 3, first statistic unit 302, second statistic unit 303 and the determining unit 305, also comprises the 3rd statistic unit 501 and computing unit 502.
Wherein, the 3rd statistic unit 501 is for total word number of statistics language material;
Computing unit 502, the collocation degree that be used for to utilize number of times that collocation word and centre word occur in the language material collocation, total degree that the collocation word occurs in language material and the total word number in the language material to calculate each arrange in pairs or groups word and centre word.
Particularly, referring to Fig. 6, the computing unit 502 among Fig. 5 obtains subelement 401, second and obtains the subelement 402 except comprising identical with computing unit shown in Figure 4 304 first, also comprises:
The 3rd obtains subelement 601, is used for obtaining from described the 3rd statistic unit total word number of language material;
Second computation subunit 602, the collocation degree that be used for to utilize number of times that collocation word and centre word occur in the language material collocation, total degree that the collocation word occurs in language material and the total word number in the language material to calculate each arrange in pairs or groups word and centre word.
Particularly, described second computation subunit 602 can be calculated the collocation degree of each collocation word and centre word according to following formula:
( i ) term _ i × log ( allcnt ÷ document _ i ) ;
Wherein, f (i) represents the collocation degree of certain centre word and collocation word i; The trem_i number of times that word i and centre word are arranged in pairs or groups in language material and occurred of representing to arrange in pairs or groups; The document_i number of times that word i occurs in the middle of whole language material of representing to arrange in pairs or groups; Allcnt represents the total word number in the language material.
Fig. 4 and Fig. 6 be the concrete structure synoptic diagram of corresponding two kinds of computing units respectively, computing unit shown in Figure 4 mainly utilize collocation word and centre word in language material, arrange in pairs or groupss the number of times that occurs and the total degree work that the collocation word occurs in language material are discussed the collocation degree that calculates each arrange in pairs or groups word and centre word; Computing unit shown in Figure 6 has gathered further collocation word and centre word arrange in pairs or groupss the basis of the total degree that the number of times that occurs and collocation word occur in language material in language material on that the total word number in the language material comes the centering word and the collocation degree of the word of arranging in pairs or groups.
Alternatively, the system that the embodiment of the invention provides, Fig. 3 or system shown in Figure 5 can also comprise assist in cell.Be example with system shown in Figure 3, this system also comprises assist in cell 306, is used for providing to the user according to the collocation degree collocation word of centre word.For example, according to each the collocation word the collocation degree to the collocation word sort, choose the collocation word that ranks in the top and offer the user.
The scheme that the embodiment of the invention provides to the collocation word of centre word in language material occurrence number and carry out on the statistical basis with number of times that the centre word collocation occurs, carry out secondary calculating again, for example the number of times that occurs by arranging in pairs or groups with collocation word and centre word is divided by the occurrence number of collocation word in language material, to from the collocation set of words, reject with the lower collocation word of centre word degree of collocation effectively, highlighted the collocation word high with centre word collocation degree more, according to the collocation degree that calculates the collocation word is resequenced, to offer the user with the high collocation word of centre word degree of collocation, improve accuracy and validity that the user grasps the centre word relevant information.
For the convenience of describing, the embodiment of the invention is divided into various unit with function and describes respectively when tracing device.Certainly, when enforcement is of the present invention, can in same or a plurality of softwares and/or hardware, realize the function of each unit.
As seen through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential general hardware platform.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product can be stored in the storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that a computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the present invention or embodiment.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and identical similar part is mutually referring to getting final product between each embodiment, and each embodiment stresses is difference with other embodiment.Especially, for system embodiment, because it is substantially similar in appearance to method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
The present invention can be used in numerous general or special purpose computingasystem environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment etc.
The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract data type, program, object, assembly, data structure etc.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, be executed the task by the teleprocessing equipment that is connected by communication network.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
Though described the present invention by embodiment, those of ordinary skills know, the present invention has many distortion and variation and do not break away from spirit of the present invention, wish that appended claim comprises these distortion and variation and do not break away from spirit of the present invention.

Claims (8)

1. determine collocation word and centre word collocation degree methods for one kind, it is characterized in that, comprising:
Obtain language material, language material is carried out word segmentation processing, determine centre word, determine the collocation word of centre word in the language material, form the collocation set of words;
The occurrence number of each collocation word in language material in the statistical collocation set of words;
The number of times that each collocation word in the statistical collocation set of words and centre word are arranged in pairs or groups in language material and occurred;
The number of times that occurs and the occurrence number of word in language material of arranging in pairs or groups are calculated the collocation degree of each collocation word and centre word to utilize collocation word and centre word to arrange in pairs or groups in language material;
Determine the collocation degree of each collocation word and centre word in conjunction with described collocation degree;
The collocation word of centre word is provided to the user according to the collocation degree;
Describedly provide the collocation word of centre word specifically to comprise according to the collocation degree to the user: from big to small the collocation word is resequenced according to the collocation degree that calculates, the collocation word conduct that collocation degree is positioned at the prostatitis is combined the word of arranging in pairs or groups the most closely and is offered the user with centre word;
In the mechanical translation field, described centre word and the word of arranging in pairs or groups the most closely of being combined with centre word, for mechanical translation provides translation result accurately, perhaps, described centre word and be combined with centre word and arrange in pairs or groups word the most closely for the treatment of polysemant, specific word analysis, the analysis of synonym antonym and centre word analysis;
In the Knowledge Discovery field, the described combination with centre word arranged in pairs or groups word the most closely for the attribute of described centre word is analyzed.
2. method according to claim 1 is characterized in that, the total degree that utilize collocation word and centre word arrange in pairs or groupss in the language material number of times that occurs and the word of arranging in pairs or groups occurs in language material calculates each collocation degree of arranging in pairs or groups word and centre word and comprises:
The number of times that utilizes collocation word and centre word arrange in pairs or groupss in language material to occur calculates the arrange in pairs or groups number of times of appearance of collocation word and centre word divided by the occurrence number of collocation word in language material and accounts for the number percent of the occurrence number of word in language material of arranging in pairs or groups in language material, with the collocation degree of described number percent as arrange in pairs or groups word and centre word.
3. method according to claim 1 is characterized in that, is obtaining language material, and after definite centre word, calculates before the collocation degree of each collocation word and centre word, also comprises:
Total word number in the statistics language material;
The collocation degree that calculates each collocation word and centre word is specially:
The number of times that utilizes collocation word and centre word in language material, arrange in pairs or groupss to occur, occurrence number and total word number in the language material collocation degree that calculate each arrange in pairs or groups word and centre word of word in language material of arranging in pairs or groups.
4. method according to claim 3, it is characterized in that the total degree that the number of times that occurs that utilizes collocation word and centre word arrange in pairs or groupss, the word of arranging in pairs or groups occur and the total word number in the language material calculate each collocation degree of arranging in pairs or groups word and centre word and be specially in language material in language material:
Calculate the collocation degree of each collocation word and centre word according to following formula:
( i ) = term _ i × log ( allcnt ÷ document _ i ) ;
Wherein, f (i) represents the collocation degree of certain centre word and collocation word i; The trem_i number of times that word i and centre word are arranged in pairs or groups in language material and occurred of representing to arrange in pairs or groups; The document_i number of times that word i occurs in the middle of whole language material of representing to arrange in pairs or groups; Allcnt represents the total word number in the language material.
5. a system that determines collocation word and centre word collocation degree is characterized in that, comprising:
Pretreatment unit is used for obtaining language material, determines centre word, and determines the collocation set of words corresponding with centre word;
First statistic unit is for each collocation word occurrence number in language material of statistical collocation set of words;
Second statistic unit is used for each collocation word of statistical collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred in language material;
Computing unit, the total degree that be used for to utilize number of times that collocation word and centre word occur in the language material collocation and collocation word to occur in language material calculates the collocation degree of each arrange in pairs or groups word and centre word;
Second determining unit is determined the collocation degree of each collocation word and centre word in conjunction with described collocation degree;
Assist in cell is for the collocation word that centre word is provided to the user according to the collocation degree;
Describedly provide the collocation word of centre word specifically to comprise according to the collocation degree to the user: from big to small the collocation word is resequenced according to the collocation degree that calculates, the collocation word conduct that collocation degree is positioned at the prostatitis is combined the word of arranging in pairs or groups the most closely and is offered the user with centre word;
In the mechanical translation field, described centre word and the word of arranging in pairs or groups the most closely of being combined with centre word, for mechanical translation provides translation result accurately, perhaps, described centre word and be combined with centre word and arrange in pairs or groups word the most closely for the treatment of polysemant, specific word analysis, the analysis of synonym antonym and centre word analysis;
In the Knowledge Discovery field, the described combination with centre word arranged in pairs or groups word the most closely for the attribute of described centre word is analyzed.
6. system according to claim 5 is characterized in that, described computing unit comprises:
First obtains subelement, is used for obtaining from described first statistic unit occurrence number of each collocation word language material of collocation set of words;
Second obtains subelement, is used for obtaining each collocation word of collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred from described second statistic unit language material;
First computation subunit, the number of times that the number of times that be used for to utilize collocation word and centre word to occur in the language material collocation occurs in language material divided by the collocation word calculates the collocation degree of each arrange in pairs or groups word and centre word.
7. system according to claim 5 is characterized in that, also comprises:
The 3rd statistic unit is for total word number of statistics language material;
Described computing unit also comprises:
The 3rd obtains subelement, is used for obtaining from described the 3rd statistic unit total word number of language material;
Second computation subunit, the collocation degree that be used for to utilize number of times that collocation word and centre word occur in the language material collocation, total degree that the collocation word occurs in language material and the total word number in the language material to calculate each arrange in pairs or groups word and centre word.
8. system according to claim 7 is characterized in that, described second computation subunit is calculated the collocation degree of each collocation word and centre word according to following formula:
( i ) = term _ i × log ( allcnt ÷ document _ i ) ;
Wherein, f (i) represents the collocation degree of certain centre word and collocation word i; The trem_i number of times that word i and centre word are arranged in pairs or groups in language material and occurred of representing to arrange in pairs or groups; The document_i number of times that word i occurs in the middle of whole language material of representing to arrange in pairs or groups; Allcnt represents the total word number in the language material.
CN 201010158112 2010-04-22 2010-04-22 Method and system for determining collocation degree of collocations with central word Active CN102236637B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010158112 CN102236637B (en) 2010-04-22 2010-04-22 Method and system for determining collocation degree of collocations with central word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010158112 CN102236637B (en) 2010-04-22 2010-04-22 Method and system for determining collocation degree of collocations with central word

Publications (2)

Publication Number Publication Date
CN102236637A CN102236637A (en) 2011-11-09
CN102236637B true CN102236637B (en) 2013-08-07

Family

ID=44887296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010158112 Active CN102236637B (en) 2010-04-22 2010-04-22 Method and system for determining collocation degree of collocations with central word

Country Status (1)

Country Link
CN (1) CN102236637B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699672A (en) * 2013-12-30 2014-04-02 北京百度网讯科技有限公司 Method and device for retrieving example sentences
CN111310481B (en) * 2020-01-19 2021-05-18 百度在线网络技术(北京)有限公司 Speech translation method, device, computer equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6278987B1 (en) * 1999-07-30 2001-08-21 Unisys Corporation Data processing method for a semiotic decision making system used for responding to natural language queries and other purposes
US6862566B2 (en) * 2000-03-10 2005-03-01 Matushita Electric Industrial Co., Ltd. Method and apparatus for converting an expression using key words
CN101196898A (en) * 2007-08-21 2008-06-11 新百丽鞋业(深圳)有限公司 Method for applying phrase index technology into internet search engine
CN101499058A (en) * 2009-03-05 2009-08-05 北京理工大学 Chinese word segmenting method based on type theory
CN101593200B (en) * 2009-06-19 2012-10-03 淮海工学院 Method for classifying Chinese webpages based on keyword frequency analysis

Also Published As

Publication number Publication date
CN102236637A (en) 2011-11-09

Similar Documents

Publication Publication Date Title
Ganesan Rouge 2.0: Updated and improved measures for evaluation of summarization tasks
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN102880600B (en) Based on the phrase semantic tendency Forecasting Methodology of world knowledge network
CN110737768A (en) Text abstract automatic generation method and device based on deep learning and storage medium
CN110147425A (en) A kind of keyword extracting method, device, computer equipment and storage medium
CN106570180A (en) Artificial intelligence based voice searching method and device
CN109635297A (en) A kind of entity disambiguation method, device, computer installation and computer storage medium
CN102609424B (en) Method and equipment for extracting assessment information
KR101541306B1 (en) Computer enabled method of important keyword extraction, server performing the same and storage media storing the same
Kaity et al. An automatic non-English sentiment lexicon builder using unannotated corpus
US20170060834A1 (en) Natural Language Determiner
CN108073571A (en) A kind of multi-language text method for evaluating quality and system, intelligent text processing system
CN102567306A (en) Acquisition method and acquisition system for similarity of vocabularies between different languages
Coats Dialect corpora from YouTube
CN110348003A (en) The abstracting method and device of text effective information
CN109472008A (en) A kind of Text similarity computing method, apparatus and electronic equipment
Kumar et al. FST based morphological analyzer for Hindi language
CN110516062B (en) Method and device for searching and processing document
Ye et al. Mining sentiment tendencies and summaries from consumer reviews
CN108875743A (en) A kind of text recognition method and device
Östling et al. Compounding in a Swedish blog corpus
CN102236637B (en) Method and system for determining collocation degree of collocations with central word
CN107315735B (en) Method and equipment for note arrangement
Ho et al. Concept evolution modeling using semantic vectors
CN110427626B (en) Keyword extraction method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Free format text: FORMER OWNER: BEIJING JINSHAN DIGITAL ENTERTAINMENT SCIENCE AND TECHNOLOGY CO., LTD.

Effective date: 20140312

Owner name: BEIJING KINGSOFT OFFICE SOFTWARE CO., LTD.

Free format text: FORMER OWNER: BEIJING JINSHAN SOFTWARE CO., LTD.

Effective date: 20140312

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20140312

Address after: Kingsoft No. 33 building, 100085 Beijing city Haidian District Xiaoying Road

Patentee after: Beijing Kingsoft WPS Office Co., Ltd.

Address before: Kingsoft 33 Building No. 100085 Beijing Haidian District City 1 Xiaoying Road West

Patentee before: Beijing Jinshan Software Co., Ltd.

Patentee before: Beijing Jinshan Digital Entertainment Science and Technology Co., Ltd.

C56 Change in the name or address of the patentee
CP01 Change in the name or title of a patent holder

Address after: Kingsoft No. 33 building, 100085 Beijing city Haidian District Xiaoying Road

Patentee after: Beijing Kingsoft office software Limited by Share Ltd

Address before: Kingsoft No. 33 building, 100085 Beijing city Haidian District Xiaoying Road

Patentee before: Beijing Kingsoft WPS Office Co., Ltd.