CN102236637A - Method and system for determining collocation degree of collocations with central word - Google Patents

Method and system for determining collocation degree of collocations with central word Download PDF

Info

Publication number
CN102236637A
CN102236637A CN2010101581121A CN201010158112A CN102236637A CN 102236637 A CN102236637 A CN 102236637A CN 2010101581121 A CN2010101581121 A CN 2010101581121A CN 201010158112 A CN201010158112 A CN 201010158112A CN 102236637 A CN102236637 A CN 102236637A
Authority
CN
China
Prior art keywords
collocation
speech
centre word
language material
pairs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010101581121A
Other languages
Chinese (zh)
Other versions
CN102236637B (en
Inventor
张宇峰
陈学文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Original Assignee
Beijing Kingsoft Software Co Ltd
Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Software Co Ltd, Beijing Jinshan Digital Entertainment Technology Co Ltd filed Critical Beijing Kingsoft Software Co Ltd
Priority to CN 201010158112 priority Critical patent/CN102236637B/en
Publication of CN102236637A publication Critical patent/CN102236637A/en
Application granted granted Critical
Publication of CN102236637B publication Critical patent/CN102236637B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a method and a system for determining a collocation degree of collocations with a central word. In a scheme provided by the embodiment of the invention, a secondary calculation for the collocation degree is performed based on counting the occurrence number of the collocations of the central word in a corpus and the occurrence number of the collocations while being collocated with the central word, for example, the occurrence number of the collocations in the corpus is divided by the occurrence number of the collocations while being collocated with the central word, so that the collocations having lower collocation degree with the central word is effectively deleted from a collocations set, and the collocations having higher collocation degree with the central word are highlighted more; and the collocations are ordered again according to the calculated collocation degree, and the collocations having high collocation degree are those combined with the central word most closely.

Description

A kind of definite collocation speech and centre word collocation degree methods and system
Technical field
The present invention relates to networking technology area, relate in particular to a kind of definite collocation speech and centre word collocation degree methods and system.
Background technology
Along with constantly popularizing of network technology, network has incorporated the various aspects of people's routine work and life.The universal quantity of information that makes of network constantly expands.In information access process, the understanding of certain speech is subjected to and the often influence of the speech of collocation of this speech to a great extent.In the embodiment of the present application, for simplicity, certain speech is called centre word, the speech that occurs with centre word collocation is the speech of arranging in pairs or groups.
The inventor is by to the discovering of prior art, along with the increase of quantity of information, in the face of numerous and jumbled numerous and complicated information, present technology can't be determined from the speech that numerous and centre word collocation occur and the centre word collocation speech of arranging in pairs or groups the most closely exactly.
Summary of the invention
In view of this, the purpose of the embodiment of the invention provides a kind of definite collocation speech and centre word collocation degree methods and system, realizes determining exactly from the speech that numerous and centre word collocation occur and the centre word collocation speech of arranging in pairs or groups the most closely.
For achieving the above object, the embodiment of the invention provides following technical scheme:
A kind of definite collocation speech and centre word collocation degree methods comprise:
Obtain language material, determine centre word, and determine the collocation set of words corresponding with centre word;
The occurrence number of each collocation speech in language material in the statistical collocation set of words;
The number of times that each collocation speech in the statistical collocation set of words and centre word are arranged in pairs or groups in language material and occurred;
The number of times that occurs and the occurrence number of speech in language material of arranging in pairs or groups are calculated the collocation degree of each collocation speech and centre word to utilize collocation speech and centre word to arrange in pairs or groups in language material;
Determine the collocation degree of each collocation speech and centre word in conjunction with described collocation degree.
The total degree that utilize collocation speech and centre word arrange in pairs or groupss in the language material number of times that occurs and the speech of arranging in pairs or groups occurs in language material calculates each collocation degree of arranging in pairs or groups speech and centre word and comprises:
The number of times that utilizes collocation speech and centre word arrange in pairs or groupss in language material to occur calculates the collocation degree of each arrange in pairs or groups speech and centre word divided by the occurrence number of speech in language material of arranging in pairs or groups.
Obtaining language material, and after definite centre word, calculating before the collocation degree of each collocation speech and centre word, also comprising:
Total speech number in the statistics language material;
The collocation degree that calculates each collocation speech and centre word is specially:
The number of times that utilizes collocation speech and centre word in language material, arrange in pairs or groupss to occur, occurrence number and total speech number in the language material collocation degree that calculate each arrange in pairs or groups speech and centre word of speech in language material of arranging in pairs or groups.
Total degree that the number of times that occurs that utilizes collocation speech and centre word arrange in pairs or groupss in language material, the speech of arranging in pairs or groups occur in language material and the total speech number in the language material calculate each collocation degree of arranging in pairs or groups speech and centre word and are specially:
Calculate the collocation degree of each collocation speech and centre word according to following formula:
( i ) term _ i × log ( allcnt ÷ document _ i ) ;
Wherein, f (i) represents the collocation degree of certain centre word and collocation speech i; The trem_i number of times that speech i and centre word are arranged in pairs or groups in language material and occurred of representing to arrange in pairs or groups; The document_i number of times that speech i is occurred in the middle of whole language material of representing to arrange in pairs or groups; Allcnt represents the total speech number in the language material.
Said method also comprises:
The collocation speech of centre word is provided to the user according to the collocation degree.
The system of a kind of definite collocation speech and centre word collocation degree comprises:
Pretreatment unit is used to obtain language material, determines centre word, and determines the collocation set of words corresponding with centre word;
First statistic unit is used for each collocation speech occurrence number in language material of statistical collocation set of words;
Second statistic unit is used for each collocation speech of statistical collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred in language material;
Computing unit, the total degree that is used for utilizing number of times that collocation speech and centre word occur in the language material collocation and collocation speech to occur in language material calculates the collocation degree of each arrange in pairs or groups speech and centre word;
Second determining unit is determined the collocation degree of each collocation speech and centre word in conjunction with described collocation degree.
Described computing unit comprises:
First obtains subelement, is used for obtaining from described first statistic unit occurrence number of each collocation speech language material of collocation set of words;
Second obtains subelement, is used for obtaining each collocation speech of collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred from described second statistic unit language material;
First computation subunit is used for utilizing collocation speech and centre word to calculate the collocation degree of each collocation speech and centre word divided by the occurrence number of collocation speech in language material at the number of times of language material collocation appearance.
Said system also comprises:
The 3rd statistic unit is used for adding up total speech number of language material;
Described computing unit also comprises:
The 3rd obtains subelement, is used for obtaining from described the 3rd statistic unit total speech number of language material;
Second computation subunit, the collocation degree that is used for utilizing number of times that collocation speech and centre word occur in the language material collocation, total degree that the collocation speech occurs in language material and the total speech number in the language material to calculate each arrange in pairs or groups speech and centre word.
Described second computation subunit is calculated the collocation degree of each collocation speech and centre word according to following formula:
( i ) term _ i × log ( allcnt ÷ document _ i ) ;
Wherein, f (i) represents the collocation degree of certain centre word and collocation speech i; The trem_i number of times that speech i and centre word are arranged in pairs or groups in language material and occurred of representing to arrange in pairs or groups; The document_i number of times that speech i is occurred in the middle of whole language material of representing to arrange in pairs or groups; Allcnt represents the total speech number in the language material.
Said system also comprises:
Assist in cell is used for providing to the user according to the collocation degree collocation speech of centre word.
As seen, the scheme that the embodiment of the invention provided to the collocation speech of centre word in language material occurrence number and carry out on the statistical basis with number of times that the centre word collocation occurs, carry out secondary calculating again, for example the number of times that occurs by arranging in pairs or groups with collocation speech and centre word is divided by the occurrence number of collocation speech in language material, to from the collocation set of words, reject with the lower collocation speech of centre word degree of collocation effectively, highlighted the collocation speech high more with centre word collocation degree, from big to small the collocation speech is resequenced according to the collocation degree that calculates, the collocation speech that the collocation degree is positioned at the prostatitis is with centre word and combines the speech of arranging in pairs or groups the most closely.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, the accompanying drawing that describes below only is some embodiment that put down in writing among the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is method flow diagram that one embodiment of the invention provided;
Fig. 2 is the process flow diagram of the method that another embodiment of the present invention provided;
Fig. 3 is the structural representation of the system that one embodiment of the invention provided;
Fig. 4 is the structural representation of a unit in the system that one embodiment of the invention provided;
Fig. 5 is the structural representation of the system that another embodiment of the present invention provided;
Fig. 6 is the structural representation of a unit in the system that another embodiment of the present invention provided.
Embodiment
In order to make those skilled in the art person understand technical scheme among the present invention better, below in conjunction with the accompanying drawing in the embodiment of the invention, technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills should belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.
Referring to Fig. 1, a kind of definite collocation speech that the embodiment of the invention provides and centre word collocation degree methods comprise:
S101, obtain language material, language material is carried out word segmentation processing, determine centre word, and determine the collocation set of words corresponding with centre word;
Language material is the basis of the scheme that provides of the embodiment of the invention, and language material can be by the needs of user according to oneself, collects in conjunction with specific field to form.Also there are at present many disclosed corpus can be for people to use on the network.The language of language material can be Chinese, English or other language, and the application does not do qualification to this, and for convenience of description, the embodiment of the invention is that example describes with Chinese language material.
The object that the present invention studied is a speech, and the embodiment of the invention is that example is described with Chinese.After obtaining language material, can carry out word segmentation processing to language material earlier, language material is become the set of a large amount of speech.Certainly,, among some embodiment, can also carry out some pre-service, for example remove onomatopoeia in the language material or the like language material for the convenience of follow-up use.
After the participle, also need the definite centre word that will study.Embodiment of the invention problem to be solved is promptly determined and the centre word collocation speech of arranging in pairs or groups the most closely in conjunction with language material, and then can be carried out various application in conjunction with the collocation speech.Determining and can carrying out according to actual needs of centre word needs to determine the collocation speech of which speech, just can be with which speech as centre word.
Generally speaking, the volume ratio of language material is abundanter, also can be many with the collocation speech of centre word collocation, in the embodiment of the invention, the collocation set of words all can be put in all collocation speech of centre word.
The occurrence number of each collocation speech in language material in S102, the statistical collocation set of words;
The number of times that each collocation speech in S103, the statistical collocation set of words and centre word are arranged in pairs or groups in language material and occurred;
In the embodiment of the invention, the collocation speech is the notion relative with centre word.In fact each collocation speech all is an independently speech, and the collocation speech except occurring with the centre word collocation, also may independently occur in language material, and perhaps the speech collocation with other occurs.In the embodiment of the invention, quantity and collocation speech that total speech number, collocation speech in the language material occur are added up respectively with the number of times of centre word collocation appearance.
S104, the number of times that occurs that utilizes collocation speech and centre word arrange in pairs or groupss in language material and the occurrence number of speech in language material of arranging in pairs or groups are calculated the collocation degree of each arrange in pairs or groups speech and centre word;
In the embodiment of the invention, can calculate the collocation degree of each arrange in pairs or groups speech and centre word divided by the appearance total degree of speech in language material of arranging in pairs or groups with in language material, the arrange in pairs or groups number of times that occurs of collocation speech and centre word.
S105, determine the collocation degree of each collocation speech and centre word in conjunction with described collocation degree.
In the practical application, can the collocation speech be sorted in conjunction with the collocation degree of collocation speech and centre word, the collocation speech that the collocation degree is high and the collocation degree of centre word are higher than the collocation degree of low collocation speech of collocation degree and centre word.
Prior art is when determining collocation speech and centre word collocation degree, general only definite according to the number of times of collocation speech and centre word collocation appearance, because the factor of reference is too simple, cause the determined centre word under a lot of situations and the collocation degree of collocation speech not to conform to the actual conditions, very inaccurate.
In the embodiment of the invention, the number of times that occurs and the occurrence number of speech in language material of arranging in pairs or groups are calculated the collocation degree of each collocation speech and centre word to utilize collocation speech and centre word to arrange in pairs or groups in language material, promptly when determining collocation speech and centre word collocation degree, except the number of times of considering collocation speech and centre word collocation appearance, also considered the collocation speech from the occurrence number in language material, the collocation speech of determining by the method that the embodiment of the invention provided can reflect more truly with the collocation degree of centre word and influence to the speech itself of arranging in pairs or groups appears in the centre word collocation.For example, if a speech A occurs in language material 100 times, occur 30 times with centre word M collocation; Another speech B occurs in language material 50 times, occurs 25 times with centre word M collocation, if so according to the method for prior art, the collocation degree of speech B and centre word M is higher than the collocation degree of speech A and centre word M.But in fact, because speech B has occurred 50 times altogether, have 25 times and all occur with centre word M collocation, the collocation occurrence number accounts for 50% of the occurrence number of speech B in language material; , promptly the collocation degree of speech B and centre word M is 50%; And speech A and centre word M collocation occurrence number account for 30% of the occurrence number of speech B in language material, the collocation degree that is speech A and centre word M is 30%, according to the method that the embodiment of the invention provided, the collocation degree of speech B and centre word M is higher than the collocation degree of speech A and centre word M, and the collocation degree of speech B and centre word M is higher than the collocation degree of speech A and centre word M.
In the practical application, can also be optimized, for example, in another embodiment of the present invention, can after step S101, before step S105, increase following steps in conjunction with the collocation degree of more information to collocation speech and centre word:
Total speech number in the statistics language material.
In general, language material is in case definite, and the total speech number in the language material has just been decided, and for whole language material, the total speech number in the language material is equivalent to a constant.
Behind total speech number in obtaining language material, when spending, the collocation of calculating collocation speech and centre word can carry out further combined with total speech number of language material.For example, utilize some functions that the number of times and the occurrence number of collocation speech in language material of collocation speech and centre word collocation appearance are carried out smoothing processing.
Formula 1 is a kind of computing formula of calculating the collocation degree of collocation speech and centre word that one embodiment of the invention provides.
( i ) term _ i × log ( allcnt ÷ document _ i ) Formula 1
Wherein, f (i) represents the collocation degree of certain centre word and collocation speech i, trem_i arrange in pairs or groups in the language material number of times of appearance of speech i and centre word of representing to arrange in pairs or groups.In the embodiment of the invention, we represent the collocation set of words that all collocation speech of a centre word correspondence are formed with trem.Illustrate: for example the collocation speech of centre word " hacker " comprise " ", " instrument " etc., then " " with " hacker " in whole language material, arrange in pairs or groups number of times of appearance be trem_i.The document_i occurrence number of speech i in the middle of whole language material of representing to arrange in pairs or groups in the embodiment of the invention, is designated as document with whole language material.For example: the document_i of " instrument " represents the occurrence number of " instrument " this speech in whole language material; Allcnt represents the quantity of central all speech of whole language material, and in the embodiment of the invention, speech is repeatably.
Formula 1 is at term_i/document_i, and the number of times that speech i and the centre word collocation of promptly arranging in pairs or groups occurs comes out divided by expansion on the basis of this formula of occurrence number of this collocation speech in whole language material.The high more then centre word of the value of term_i with the collocation speech in language material, arrange in pairs or groups the appearance number of times many more; Document_i is high more to represent that then the occurrence number of this collocation speech in whole language material is many more.If the occurrence number of collocation speech in whole language material is a lot, but the words that the value of term_i/document_i is very little, the situation that collocation speech and centre word collocation appearance is described is compared very little with the situation that the collocation speech occurs, then arranging in pairs or groups, relatively the degree of influence of this centre word is little for speech, and the collocation speech that is calculated and the collocation degree of centre word are low.
Generally speaking, any one speech all becomes rarefaction state in whole language material, considers the sparse property of whole language material, and smoothly the value of term_i and document_i is to the influence of collocation degree.Among one embodiment, can utilize mathematical formulae that trem_i and document_i are carried out smoothing processing.For example, the characteristic of associative function curve, by
Figure GSA00000114271700072
And log (allcnt ÷ document_i) is optimized processing to term_i and document_i.Wherein, in order to eliminate of the influence of language material length difference, use the value of allcnt that formula is proofreaied and correct to f (i).
In the embodiment of the invention, it is because the value of document_i is generally all bigger that the occurrence number document_i of collocation speech in language material taken the logarithm, language material becomes sparse shape, by the influence of level and smooth document_i to marking of taking the logarithm, has dwindled the gap between the different orders of magnitude.For example, if the document_i of a collocation speech A of certain centre word is 100000, the document_i of another collocation speech B of this centre word is 1000000, differ ten times with regard to poor between the two, by taking the logarithm, the two gap has become 1 on logarithmic curve, as seen by the document_i curve of having taken the logarithm smoothly.In the method that the embodiment of the invention provided, term_i is higher than the importance of document_i to the importance of collocation degree, so adopt term_i to level and smooth trem_i value the time 1/2Calculate.Using allcnt in the embodiment of the invention is for log (document_i) curve being transferred to the X-axis top divided by document_i, make the value of log (document_i) be one greater than zero value.
Can know that by above-mentioned analysis the computing formula of the calculating collocation degree that the embodiment of the invention provided not is unique, the present invention does not do qualification to the concrete form of collocation degree computing formula, as long as can realize above-mentioned purpose.
Referring to Fig. 2, the method that the embodiment of the invention provided is described in detail below in conjunction with an instantiation.
S201, obtain 1,800,000 Chinese sentences, language material is carried out word segmentation processing, determine that current centre word is " hacker " as language material.
As the example of centre word the method that the embodiment of the invention is provided is described with " hacker " in the embodiment of the invention.In the practical application, the particular content and the number of centre word are not done qualification, can determine according to actual needs.No matter be a centre word or a plurality of centre word, determine that the collocation speech is all identical with centre word collocation degree methods, present embodiment is that example is introduced with a centre word.
M counted in total speech in S202, the statistics language material.
S203, determine the collocation speech of centre word in the language material, form the collocation set of words.
The occurrence number d_i of each collocation speech in language material and number of times t of collocation speech and centre word " hacker " collocation appearance in the collocation set of words in S204, the statistics language material i
S205, according to formula 1 calculate in the collocation set of words each collocation speech with collocation degree f (i) centre word " hacker ".
f ( i ) t i × log ( M ÷ d _ i ) Formula 2
For example, in the embodiment of the invention, collocation speech in the collocation set of words of " hacker " this centre word correspondence comprises:, attack, instrument, software, intrusion, invasion, data, attempt, steal, totally 10 of culture, the number of times that occurs according to collocation speech and centre word " hacker " collocation of putting in order promptly of these 10 collocation speech is from how to few arrangement at present.For convenience, be numbered 1~10 for the unification of these 10 collocation speech.For example, " " this collocation speech be numbered 1, " invasion " this speech of arranging in pairs or groups be numbered 5, and the like.Corresponding, t 5The number of times that this collocation speech of expression " invasion " and centre word " hacker " collocation occur, d_5 are represented the occurrence number of " invasion " this collocation speech in language material.The collocation degree of this collocation speech of f (5) expression " invasion " and centre word " hacker ".Calculate the above-mentioned 10 collocation degree of respectively arranging in pairs or groups speech and centre word " hacker " respectively according to formula 2.
S206, the collocation speech is sorted, choose to be positioned at and come preceding 5 collocation speech, offer the user according to the collocation degree of these 10 collocation speech.
By the calculating of formula 2, and to above-mentioned 10 respectively arrange in pairs or groups speech rearrangements be from big to small according to the collocation degree that calculates: attack, steal, intrusion, instrument, invade, attempt, data, software, culture.The collocation degree can be come preceding 5 collocation speech as attack, steal, intrusion, instrument, invasion offer the user, and these speech of prompting user are and the higher speech of " hacker " collocation degree to improve the accuracy that the centre word relevant information is grasped by the higher speech of these collocation degree by assisting users.
The method that the embodiment of the invention provided to the collocation speech of centre word in language material occurrence number and carry out on the statistical basis with number of times that the centre word collocation occurs, carry out secondary calculating again, for example the number of times that occurs by arranging in pairs or groups with collocation speech and centre word is divided by the occurrence number of collocation speech in language material, to from the collocation set of words, reject with the lower collocation speech of centre word degree of collocation effectively, highlighted the collocation speech high more with centre word collocation degree, according to the collocation degree that calculates the collocation speech is resequenced, to offer the user with the high collocation speech of centre word degree of collocation, and improve and determined and arrange in pairs or groups the most closely accuracy of speech of centre word collocation.
In the practical application, determined the higher collocation speech of centre word collocation degree, these collocation speech can be widely used.For example, aspect the search suggestion, if the user imports " hacker " in the search box, utilize definite scheme of the collocation speech of centre word provided by the invention then, determine and the high collocation speech of " hacker " collocation degree, as " assault ", " hacker steals ", " hacker's intrusion " or the like, can show the user by modes such as the drop-down menu collocation speech that these and " hacker " collocation degree is high, make things convenient for the user further to search for.
The method that the embodiment of the invention provided can also be applied in the mechanical translation field, for example can provide translation result accurately for mechanical translation according to centre word and the collocation speech thereof that the method that the embodiment of the invention provided is determined, polysemant, specific word analysis, the analysis of synonym antonym handled in the collocation speech that can also utilize centre word and determine, centre word is analyzed or the like.The speech of for example arranging in pairs or groups is the bilingual corpora of alignment, the Chinese of correspondence and English when promptly arranging in pairs or groups speech, then can utilize the attribute of centre word and collocation speech analysis and statistical collocation speech, confirm the meaning of a word of collocation speech, for example contain " make " in a sentence to be translated, " make " is a polysemant, and directly translation can not determine should be translated into what content this moment, and just can determine should how to translate this moment " make " according to the collocation speech of make this moment.For example, " make " back is pronoun " it ", can know according to the collocation speech statistics of " make ", when combining with pronoun, make generally can be translated as " make, allow ", thereby can improve the accuracy of translation in conjunction with the collocation speech statistics of centre word.
In addition, centre word that the method for utilizing the embodiment of the invention to provide is determined and collocation speech can also be applied to the Knowledge Discovery field, can analyze the attribute of centre word, such as centre word and collocation speech are all enclosed part-of-speech tagging, the statistical collocation speech then can analytic centre's speech the collocation part of speech, as V (verb)+V, V+N etc., thus provide foundation for studying mutually combining between the part of speech.
One embodiment of the invention also provides the system of a kind of definite collocation speech and centre word collocation degree, and referring to Fig. 3, this system comprises:
Pretreatment unit 301 is used to obtain language material, and language material is carried out word segmentation processing, determines centre word, and determines collocation speech and the set of centre word collocation degree;
First statistic unit 302 is used for each collocation speech occurrence number in language material of statistical collocation set of words;
Second statistic unit 303 is used for each collocation speech of statistical collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred in language material;
Computing unit 304, the total degree that is used for utilizing number of times that collocation speech and centre word occur in the language material collocation and collocation speech to occur in language material calculates the collocation degree of each arrange in pairs or groups speech and centre word;
Determining unit 305 is determined the collocation degree of each collocation speech and centre word in conjunction with described collocation degree.
Particularly, referring to Fig. 4, described computing unit 305 comprises:
First obtains subelement 401, is used for obtaining from described first statistic unit occurrence number of each collocation speech language material of collocation set of words;
Second obtains subelement 402, is used for obtaining each collocation speech of collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred from described second statistic unit language material;
First computation subunit 403 is used for utilizing collocation speech and centre word to make the collocation degree that the merchant calculates each collocation speech and centre word at language material collocation number of times that occurs and the total degree that the collocation speech occurs in language material.
Prior art is when determining collocation speech and centre word collocation degree, general only definite according to the number of times of collocation speech and centre word collocation appearance, because the factor of reference is too simple, cause the determined centre word under a lot of situations and the collocation degree of collocation speech not to conform to the actual conditions, very inaccurate, make the user also inaccurate in conjunction with the information of the centre word that gets access to of collocation speech.
In the embodiment of the invention, the total degree that utilize collocation speech and centre word arrange in pairs or groupss in the language material number of times that occurs and the speech of arranging in pairs or groups occurs in language material calculates the collocation degree of each arrange in pairs or groups speech and centre word, promptly when determining collocation speech and centre word collocation degree, except the number of times of considering collocation speech and centre word collocation appearance, also considered the collocation speech from the occurrence number in language material, the collocation speech of determining by the scheme that the embodiment of the invention provided can reflect more truly with the collocation degree of centre word and influence to the speech itself of arranging in pairs or groups appears in the centre word collocation.
Referring to Fig. 5, system for a kind of definite collocation speech that another embodiment of the present invention provided and centre word collocation degree, this system comprises outside the pretreatment unit identical with system shown in Figure 3 301, first statistic unit 302, second statistic unit 303 and the determining unit 305, also comprises the 3rd statistic unit 501 and computing unit 502.
Wherein, the 3rd statistic unit 501 is used for adding up total speech number of language material;
Computing unit 502, the collocation degree that is used for utilizing number of times that collocation speech and centre word occur in the language material collocation, total degree that the collocation speech occurs in language material and the total speech number in the language material to calculate each arrange in pairs or groups speech and centre word.
Particularly, referring to Fig. 6, the computing unit 502 among Fig. 5 obtains subelement 401, second and obtains the subelement 402 except comprising identical with computing unit shown in Figure 4 304 first, also comprises:
The 3rd obtains subelement 601, is used for obtaining from described the 3rd statistic unit total speech number of language material;
Second computation subunit 602, the collocation degree that is used for utilizing number of times that collocation speech and centre word occur in the language material collocation, total degree that the collocation speech occurs in language material and the total speech number in the language material to calculate each arrange in pairs or groups speech and centre word.
Particularly, described second computation subunit 602 can be calculated the collocation degree of each collocation speech and centre word according to following formula:
( i ) term _ i × log ( allcnt ÷ document _ i ) ;
Wherein, f (i) represents the collocation degree of certain centre word and collocation speech i; The trem_i number of times that speech i and centre word are arranged in pairs or groups in language material and occurred of representing to arrange in pairs or groups; The document_i number of times that speech i is occurred in the middle of whole language material of representing to arrange in pairs or groups; Allcnt represents the total speech number in the language material.
Fig. 4 and Fig. 6 be the concrete structure synoptic diagram of corresponding two kinds of computing units respectively, computing unit shown in Figure 4 mainly utilize collocation speech and centre word in language material, arrange in pairs or groupss number of times that occurs and the total degree work that the collocation speech occurs in language material are discussed the collocation degree that calculates each arrange in pairs or groups speech and centre word; Computing unit shown in Figure 6 has gathered further collocation speech and centre word arrange in pairs or groupss the basis of the total degree that the number of times that occurs and collocation speech occur in language material in language material on that the total speech number in the language material comes the centering speech and the collocation degree of the speech of arranging in pairs or groups.
Alternatively, the system that the embodiment of the invention provided, Fig. 3 or system shown in Figure 5 can also comprise assist in cell.With system shown in Figure 3 is example, and this system also comprises assist in cell 306, is used for providing to the user according to the collocation degree collocation speech of centre word.For example, according to each the collocation speech the collocation degree to the collocation speech sort, choose the collocation speech that ranks in the top and offer the user.
The scheme that the embodiment of the invention provided to the collocation speech of centre word in language material occurrence number and carry out on the statistical basis with number of times that the centre word collocation occurs, carry out secondary calculating again, for example the number of times that occurs by arranging in pairs or groups with collocation speech and centre word is divided by the occurrence number of collocation speech in language material, to from the collocation set of words, reject with the lower collocation speech of centre word degree of collocation effectively, highlighted the collocation speech high more with centre word collocation degree, according to the collocation degree that calculates the collocation speech is resequenced, to offer the user with the high collocation speech of centre word degree of collocation, improve accuracy and validity that the user grasps the centre word relevant information.
For the convenience of describing, the embodiment of the invention is divided into various unit with function and describes respectively when tracing device.Certainly, when enforcement is of the present invention, can in same or a plurality of softwares and/or hardware, realize the function of each unit.
As seen through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential general hardware platform.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product can be stored in the storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that a computer equipment (can be a personal computer, server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the present invention or embodiment.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and identical similar part is mutually referring to getting final product between each embodiment, and each embodiment stresses all is difference with other embodiment.Especially, for system embodiment, because it is substantially similar in appearance to method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
The present invention can be used in numerous general or special purpose computingasystem environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment or the like.
The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, by by communication network connected teleprocessing equipment execute the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
Though described the present invention by embodiment, those of ordinary skills know, the present invention has many distortion and variation and do not break away from spirit of the present invention, wish that appended claim comprises these distortion and variation and do not break away from spirit of the present invention.

Claims (10)

1. determine collocation speech and centre word collocation degree methods for one kind, it is characterized in that, comprising:
Obtain language material, determine centre word, and determine the collocation set of words corresponding with centre word;
The occurrence number of each collocation speech in language material in the statistical collocation set of words;
The number of times that each collocation speech in the statistical collocation set of words and centre word are arranged in pairs or groups in language material and occurred;
The number of times that occurs and the occurrence number of speech in language material of arranging in pairs or groups are calculated the collocation degree of each collocation speech and centre word to utilize collocation speech and centre word to arrange in pairs or groups in language material;
Determine the collocation degree of each collocation speech and centre word in conjunction with described collocation degree.
2. method according to claim 1 is characterized in that, the total degree that utilize collocation speech and centre word arrange in pairs or groupss in the language material number of times that occurs and the speech of arranging in pairs or groups occurs in language material calculates each collocation degree of arranging in pairs or groups speech and centre word and comprises:
The number of times that utilizes collocation speech and centre word arrange in pairs or groupss in language material to occur calculates the collocation degree of each arrange in pairs or groups speech and centre word divided by the occurrence number of speech in language material of arranging in pairs or groups.
3. method according to claim 1 is characterized in that, is obtaining language material, and after definite centre word, calculates before the collocation degree of each collocation speech and centre word, also comprises:
Total speech number in the statistics language material;
The collocation degree that calculates each collocation speech and centre word is specially:
The number of times that utilizes collocation speech and centre word in language material, arrange in pairs or groupss to occur, occurrence number and total speech number in the language material collocation degree that calculate each arrange in pairs or groups speech and centre word of speech in language material of arranging in pairs or groups.
4. method according to claim 3, it is characterized in that total degree that the number of times that occurs that utilizes collocation speech and centre word arrange in pairs or groupss, the speech of arranging in pairs or groups occur and the total speech number in the language material calculate each collocation degree of arranging in pairs or groups speech and centre word and be specially in language material in language material:
Calculate the collocation degree of each collocation speech and centre word according to following formula:
f ( i ) = term _ i × log ( allcnt ÷ document _ i ) ;
Wherein, f (i) represents the collocation degree of certain centre word and collocation speech i; The trem_i number of times that speech i and centre word are arranged in pairs or groups in language material and occurred of representing to arrange in pairs or groups; The document_i number of times that speech i is occurred in the middle of whole language material of representing to arrange in pairs or groups; Allcnt represents the total speech number in the language material.
5. according to any described method of claim 1-4, it is characterized in that, also comprise:
The collocation speech of centre word is provided to the user according to the collocation degree.
6. a system that determines collocation speech and centre word collocation degree is characterized in that, comprising:
Pretreatment unit is used to obtain language material, determines centre word, and determines the collocation set of words corresponding with centre word;
First statistic unit is used for each collocation speech occurrence number in language material of statistical collocation set of words;
Second statistic unit is used for each collocation speech of statistical collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred in language material;
Computing unit, the total degree that is used for utilizing number of times that collocation speech and centre word occur in the language material collocation and collocation speech to occur in language material calculates the collocation degree of each arrange in pairs or groups speech and centre word;
Second determining unit is determined the collocation degree of each collocation speech and centre word in conjunction with described collocation degree.
7. system according to claim 6 is characterized in that, described computing unit comprises:
First obtains subelement, is used for obtaining from described first statistic unit occurrence number of each collocation speech language material of collocation set of words;
Second obtains subelement, is used for obtaining each collocation speech of collocation set of words and the number of times that centre word is arranged in pairs or groups and occurred from described second statistic unit language material;
First computation subunit, the number of times that the number of times that is used for utilizing collocation speech and centre word to occur in the language material collocation occurs in language material divided by the collocation speech calculates the collocation degree of each arrange in pairs or groups speech and centre word.
8. system according to claim 6 is characterized in that, also comprises:
The 3rd statistic unit is used for adding up total speech number of language material;
Described computing unit also comprises:
The 3rd obtains subelement, is used for obtaining from described the 3rd statistic unit total speech number of language material;
Second computation subunit, the collocation degree that is used for utilizing number of times that collocation speech and centre word occur in the language material collocation, total degree that the collocation speech occurs in language material and the total speech number in the language material to calculate each arrange in pairs or groups speech and centre word.
9. system according to claim 8 is characterized in that, described second computation subunit is calculated the collocation degree of each collocation speech and centre word according to following formula:
f ( i ) = term _ i × log ( allcnt ÷ document _ i ) ;
Wherein, f (i) represents the collocation degree of certain centre word and collocation speech i; The trem_i number of times that speech i and centre word are arranged in pairs or groups in language material and occurred of representing to arrange in pairs or groups; The document_i number of times that speech i is occurred in the middle of whole language material of representing to arrange in pairs or groups; Allcnt represents the total speech number in the language material.
10. according to any described system of claim 6-9, it is characterized in that, also comprise:
Assist in cell is used for providing to the user according to the collocation degree collocation speech of centre word.
CN 201010158112 2010-04-22 2010-04-22 Method and system for determining collocation degree of collocations with central word Active CN102236637B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010158112 CN102236637B (en) 2010-04-22 2010-04-22 Method and system for determining collocation degree of collocations with central word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010158112 CN102236637B (en) 2010-04-22 2010-04-22 Method and system for determining collocation degree of collocations with central word

Publications (2)

Publication Number Publication Date
CN102236637A true CN102236637A (en) 2011-11-09
CN102236637B CN102236637B (en) 2013-08-07

Family

ID=44887296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010158112 Active CN102236637B (en) 2010-04-22 2010-04-22 Method and system for determining collocation degree of collocations with central word

Country Status (1)

Country Link
CN (1) CN102236637B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699672A (en) * 2013-12-30 2014-04-02 北京百度网讯科技有限公司 Method and device for retrieving example sentences
CN111310481A (en) * 2020-01-19 2020-06-19 百度在线网络技术(北京)有限公司 Speech translation method, device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6278987B1 (en) * 1999-07-30 2001-08-21 Unisys Corporation Data processing method for a semiotic decision making system used for responding to natural language queries and other purposes
US20020010573A1 (en) * 2000-03-10 2002-01-24 Matsushita Electric Industrial Co., Ltd. Method and apparatus for converting expression
CN101196898A (en) * 2007-08-21 2008-06-11 新百丽鞋业(深圳)有限公司 Method for applying phrase index technology into internet search engine
CN101499058A (en) * 2009-03-05 2009-08-05 北京理工大学 Chinese word segmenting method based on type theory
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web page classification method based on the keyword frequency analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6278987B1 (en) * 1999-07-30 2001-08-21 Unisys Corporation Data processing method for a semiotic decision making system used for responding to natural language queries and other purposes
US20020010573A1 (en) * 2000-03-10 2002-01-24 Matsushita Electric Industrial Co., Ltd. Method and apparatus for converting expression
CN101196898A (en) * 2007-08-21 2008-06-11 新百丽鞋业(深圳)有限公司 Method for applying phrase index technology into internet search engine
CN101499058A (en) * 2009-03-05 2009-08-05 北京理工大学 Chinese word segmenting method based on type theory
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web page classification method based on the keyword frequency analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699672A (en) * 2013-12-30 2014-04-02 北京百度网讯科技有限公司 Method and device for retrieving example sentences
CN111310481A (en) * 2020-01-19 2020-06-19 百度在线网络技术(北京)有限公司 Speech translation method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN102236637B (en) 2013-08-07

Similar Documents

Publication Publication Date Title
Ganesan Rouge 2.0: Updated and improved measures for evaluation of summarization tasks
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
US20130041652A1 (en) Cross-language text clustering
US10191946B2 (en) Answering natural language table queries through semantic table representation
CN106844331A (en) Sentence similarity calculation method and system
CN110737768A (en) Text abstract automatic generation method and device based on deep learning and storage medium
CN110147425A (en) A kind of keyword extracting method, device, computer equipment and storage medium
CN109635297A (en) A kind of entity disambiguation method, device, computer installation and computer storage medium
KR101541306B1 (en) Computer enabled method of important keyword extraction, server performing the same and storage media storing the same
CN102609424B (en) Method and equipment for extracting assessment information
CN110276009B (en) Association word recommendation method and device, electronic equipment and storage medium
Kaity et al. An automatic non-English sentiment lexicon builder using unannotated corpus
CN102567306A (en) Acquisition method and acquisition system for similarity of vocabularies between different languages
US20170060834A1 (en) Natural Language Determiner
Coats Dialect corpora from YouTube
CN108073571A (en) A kind of multi-language text method for evaluating quality and system, intelligent text processing system
CN110348003A (en) Method and device for extracting effective text information
Kumar et al. FST based morphological analyzer for Hindi language
Ye et al. Mining sentiment tendencies and summaries from consumer reviews
CN102236637B (en) Method and system for determining collocation degree of collocations with central word
CN107315735B (en) Method and equipment for note arrangement
Kabadjov et al. Multilingual statistical news summarization
Ho et al. Concept evolution modeling using semantic vectors
Feng et al. Recommended or not recommended? Review classification through opinion extraction
Rouces et al. Defining a Gold Standard for a Swedish Sentiment Lexicon: Towards Higher-Yield Text Mining in the Digital Humanities.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Free format text: FORMER OWNER: BEIJING JINSHAN DIGITAL ENTERTAINMENT SCIENCE AND TECHNOLOGY CO., LTD.

Effective date: 20140312

Owner name: BEIJING KINGSOFT OFFICE SOFTWARE CO., LTD.

Free format text: FORMER OWNER: BEIJING JINSHAN SOFTWARE CO., LTD.

Effective date: 20140312

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20140312

Address after: Kingsoft No. 33 building, 100085 Beijing city Haidian District Xiaoying Road

Patentee after: Beijing Kingsoft WPS Office Co., Ltd.

Address before: Kingsoft 33 Building No. 100085 Beijing Haidian District City 1 Xiaoying Road West

Patentee before: Beijing Jinshan Software Co., Ltd.

Patentee before: Beijing Jinshan Digital Entertainment Science and Technology Co., Ltd.

C56 Change in the name or address of the patentee
CP01 Change in the name or title of a patent holder

Address after: Kingsoft No. 33 building, 100085 Beijing city Haidian District Xiaoying Road

Patentee after: Beijing Kingsoft office software Limited by Share Ltd

Address before: Kingsoft No. 33 building, 100085 Beijing city Haidian District Xiaoying Road

Patentee before: Beijing Kingsoft WPS Office Co., Ltd.