CN1086821C - Method for chinese sentence segmentation and its system - Google Patents

Method for chinese sentence segmentation and its system Download PDF

Info

Publication number
CN1086821C
CN1086821C CN98118413A CN98118413A CN1086821C CN 1086821 C CN1086821 C CN 1086821C CN 98118413 A CN98118413 A CN 98118413A CN 98118413 A CN98118413 A CN 98118413A CN 1086821 C CN1086821 C CN 1086821C
Authority
CN
China
Prior art keywords
word
participle
path
shortest
chinese sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN98118413A
Other languages
Chinese (zh)
Other versions
CN1204811A (en
Inventor
张景嵩
张金玉
郑奕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inventec Corp
Original Assignee
Inventec Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inventec Corp filed Critical Inventec Corp
Priority to CN98118413A priority Critical patent/CN1086821C/en
Publication of CN1204811A publication Critical patent/CN1204811A/en
Application granted granted Critical
Publication of CN1086821C publication Critical patent/CN1086821C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention relates to a method and a system for dividing Chinese sentences. Firstly, the input of a Chinese sentence is provided, the Chinese sentence is a character word string which is composed of a plurality of characters. The character word string is matched with words according to one word bank. Subsequently, whether a matched participle path is unique is judged, if true, the method for diving the Chinese sentence is completed; else, the shortest path of a plurality of participle paths is selected Then whether the shortest participle path is unique is judged; if true, the method of dividing the Chinese sentence is completed; else, the word number corresponding to the shortest participle paths is calculated to determine the optical path in the shortest participle paths according to a word frequency back of words. Finally, the optimal shortest participle path is output.

Description

The method of Chinese sentence segmentation and system thereof
The present invention relates to voice processing technology, particularly relate to and a kind ofly can give method for optimizing and system thereof, so as to improving the degree of accuracy of Chinese sentence word after cutting Chinese sentence segmentation.
Chinese sentence segmentation is handled and is belonged to a considerable ring in the speech processes program pretreatment stage, is to the literal word string (word string is made up of some characters) of the Chinese sentence representative of speech utterance, carries out suitably cutting.If the literal word string after cutting can accurately give expression to former semanteme, next just can give expression to the modulation in tone of this Chinese sentence in view of the above, after speech processes, just can produce mass effect again near true man's sounding.
At present, the method for Chinese sentence segmentation comprises: forward maximum matching method, reverse maximum matching method, two-way maximum matching method, by speech traversal matching method and cut label method etc. several.Wherein, the forward maximum matching method is that the prefix from the literal word string begins coupling, is syncopated as the long word that can mate at every turn, again the residue word string is repeated this step, all passes through cutting until whole literal word strings and ends.Reverse maximum matching method is to begin coupling from Chinese written language word string suffix, is syncopated as the long word that can mate at every turn, again the residue word string is repeated this step, all passes through cutting until whole literal word strings and ends.Two-way maximum matching method is the integrated use of forward maximum matching method and reverse maximum matching method, carries out forward maximum matching method and reverse maximum matching method respectively, if the difference as a result of the two institute's cutting then need adopt method for distinguishing to handle.And be that whole literal word string is hunted out long word by speech traversal matching method, be syncopated as the long word that can mate after, again the residue word string is repeated this step, all pass through cutting until whole literal word strings and end.The cut label rule is analyzed earlier in the literal word string only can as the participle sign, be divided into word string short word string with it in view of the above as the character of prefix or suffix, cooperates additive method to carry out cutting then.
Because forward maximum matching method and reverse maximum matching method are to mate from prefix and suffix respectively with unidirectional, its time complicacy is directly proportional with the interior contained number of characters (n represents with natural number) of statement.Yet,, can not guarantee the optimization of whole sentence segmentations with this both unidirectional method of mating of definite sequence.For example, be example with statement " he tells others ", if analyze with the forward maximum matching method, then after cutting for " he ∥ says ∥ and removes ∥ ∥ once " (annotate, this instructions all with " ∥ " as through the space character between each word after the cutting); And be example with the statement purpose of park " he tell others ", if with reverse maximum matching method analysis, then after cutting, be " he ∥ say ∥ go out the ∥ purpose ∥ of ∥ park ∥ ".Hence one can see that, and forward maximum matching method and reverse maximum matching method though belong to a kind of method of suboptimization, can't be guaranteed the optimization of whole sentence segmentations.
For two-way maximum matching method, it is analyzed with forward maximum matching method and reverse maximum matching method, though can provide the word composition that some may be made mistakes, be example for example with statement " he tells others ", if carry out with two-way maximum matching method, then after cutting for two kinds of " he ∥ says ∥ and removes ∥ ∥ once ", " he ∥ say ∥ go out ∥ ∥ once " etc., cooperate the method for distinguishing otherwise processed then.Yet, for the situation that can not be syncopated as by forward maximum matching method or reverse maximum matching method, similarly, can not guarantee that whole statements are able to optimization after cutting, its time complicacy has had more one times than forward maximum matching method or reverse maximum matching method.
And not only can not guarantee whole statement optimizations by speech traversal matching method, its time complicacy also with statement include number of characters square (for example with n 2The expression) even be directly proportional higher, so in fact seldom be used.Be directly proportional as for number of characters (representing) contained in the time complexity of cut label method and the statement with n.Yet because ubiquitous participle sign is actually rare, thus according to this literal word string is divided into effect than the short word string, also considerably limited.
Therefore, one object of the present invention, provide a kind of can be with the method and the system thereof of Chinese sentence segmentation, it is resulting word degree of accuracy after cutting, exceeds two orders of magnitude than unidirectional maximum matching method such as forward or reverse maximum matching method.
Another object of the present invention, provide a kind of can be with the method and the system thereof of Chinese sentence segmentation, it is resulting word degree of accuracy after cutting, than two-way maximum matching method height.
A further object of the present invention, provide a kind of can be with the method and the system thereof of Chinese sentence segmentation, number of characters contained in its time complicacy and the statement is directly proportional.
In order to achieve the above object, the present invention can finish by the method that a kind of Chinese sentence segmentation is provided.At first import a Chinese sentence, the literal word string that this Chinese sentence is made up of a plurality of character.According to a word storehouse, this literal word string is carried out word match again.Then, judge whether the participle path after coupling is unique; If then finish the method for Chinese sentence segmentation; If not, then select the shortlyest in this participle path one, then, judge whether this shortest participle path is unique; If then finish the method for Chinese sentence segmentation; If not,, calculate this shortest participle path parallel expression quantity, determine in this shortest participle path best one then according to a word word frequency base.At last, the shortest participle path of the best is exported.
In addition, the present invention also can be by providing a kind of Chinese sentence segmentation system to finish.This system comprises: a word storehouse, a word word frequency base, an input media, all divisional processing devices and an output unit.Word storehouse and word word frequency base provide respectively and carry out mating when cutting is handled required word and word frequency data.Input media then provides the input of a Chinese sentence.The cutting processor receives this Chinese sentence, according to the word data that the word storehouse provides, carries out word match successively, selects steps such as shortest path and the calculating of word quantity.If carrying out word match, selecting in the steps such as shortest path and the calculating of word quantity, it is unique that resulting participle path has belonged to, then obtains a word segmentation result.This word segmentation result is exported through output unit again.
For above-mentioned and other purposes of the present invention, feature and advantage can be become apparent, hereinafter will be, and in conjunction with the accompanying drawings by a preferred embodiment, work following detailed description:
Fig. 1 is according to Chinese sentence segmentation system block diagrams of the present invention;
Fig. 2 is the process flow diagram according to Chinese sentence segmentation method of the present invention;
Fig. 3 uses the synoptic diagram that the inventive method cutting one Chinese sentence " solves fight " immediately; And
Fig. 4 shows F (t, N, the process flow diagram of embodiment W).
Chinese sentence segmentation method of the present invention adopts three main principle: the firstth, and participle and semantic irrelevant principle, the secondth, minimum participle priority principle, the 3rd is quantized principle with the word possibility.
It is human when sentence segmentation is become word, can be by accumulation to semantic knowledge, and consider relation between itself and whole statement semantics, if but allow computer system consider the semantic relation of statement and whole statement simultaneously, then easily be absorbed in the infinite circulation of reciprocal causation.For avoiding this cause and effect round-robin phenomenon, the present invention adopts participle and needs through the irrelevant principle of cutting statement semantics, when this principle refers to that sentence segmentation become word, does not consider the relation between itself and whole statement semantics, and only considers the coupling and the ordering of word.Moreover the present invention has utilized minimum participle priority principle, and this principle is meant in various possible word segmentation result, preferentially chooses word minimum number person, in other words, chooses the shortest participle path exactly.In addition,, be meant the parameter that adopts word frequency and speech length to measure, calculate the word frequency weighted value in participle path according to this, determine best the shortest participle path with this as possibility with the principle that the word possibility is quantized.Therefore participation that need not human intelligence, with general computer processor and related hardware thereof, just cutting Chinese sentence accurately.
With reference to Fig. 1, be depicted as according to Chinese sentence segmentation system block diagrams of the present invention.This Chinese sentence segmentation system comprises: cutting processor 10, input media 12, word storehouse 14, word word frequency base 16 and output unit 18 or the like.Figure 2 shows that process flow diagram according to Chinese sentence segmentation method of the present invention.Hereinafter, describe the Chinese sentence segmentation method of Fig. 2 in detail in conjunction with Fig. 1.
With reference to Fig. 2, after the method flow process begins, import Chinese sentences from input media 12 in step 20.Proceed to step 21 then, carry out word match by cutting processor 10 according to the word data that word storehouse 14 is had, do cutting according to foregoing " participle and semantic irrelevant principle " this moment, this principle refers to when the cutting Chinese sentence becomes some words, do not consider the relation between itself and whole statement semantics, and only consider the coupling and the ordering of word, this participle path after word match may be unique, also may match multiple possible participle path.
Then, judge in step 22 whether the participle path that draws is unique after step 21 coupling.If the participle path that institute obtains after mating is unique, then advance to step 26, this word segmentation result is exported through output unit 18, finish this Chinese sentence segmentation process; If the participle path that obtains after mating be not unique, then advance to step 23, according to minimum participle priority principle, judge which participle path word minimum number in the resulting participle of the various couplings path, thereby select the shortest one in participle path according to this.
Advance to step 24 then, judge whether unique through the selected the shortest participle path of step 23.If selected the shortest participle path is unique, then directly go to step 26, word segmentation result through output unit 18 outputs, is finished this Chinese sentence segmentation process; If selected the shortest participle path is not unique, then cutting processor 10 advances to step 25, selects best in the shortest participle path one according to word word frequency base 16.Step 25 is to quantize principle according to the word possibility, adopts word frequency and speech length to measure parameter as possibility, calculates the word frequency weighted value in participle path according to this, determines the shortest participle of the best path according to this.Step 25 is selected the shortest participle of the best path, through output unit 18 outputs, finishes this Chinese sentence segmentation process then.
Below with regard to step 21 participle and semantic irrelevant principle, the minimum participle priority principle of step 23, step 25 the word possibility is quantized principle etc., describe in detail with way of example.
Participle and semantic irrelevant principle
Be the word match of energy performing step 21, preferred embodiment of the present invention defines a function Ma, and (V), it is input parameter with t for t, L, and L and V are output parameter.Wherein, t represents preceding node (back detailed description) sequence number of interior certain character of literal word string of Chinese sentence; L represents with this character initial, the speech of the match is successful long word language long; V represents that speech length is the variable matching vector of L, can show into V={V (1), and V (2) ..., V (L) }.Specifically, for node t and L character being comprised between L node thereafter, if the speech that the match is successful long for k (k=1,2 ..., L), then V (k)=1 is k (k=1,2 if mate unsuccessful speech long ..., L), V (k)=max then.For example, be example with statement " disorderly seven or eight grooves ", initial with character " unrest ", energy the match is successful that speech length that long word speaks is four, wherein, a monosyllabic word can be regarded as in character " unrest ", so V (1)=1; " random seven " can't mate the formation word, so V (2)=max; " random 78 " also can't mate the formation word, so V (3)=max; " disorderly seven or eight grooves " can mate the formation word, so V (4)=1.Therefore, V={1, max, max, 1}.
Moreover with reference to Fig. 3, it is that example is explained the semantic irrelevant principle of participle with a Chinese sentence.The statement of supposing required cutting has n character, and then the node number is n+1.Chinese sentence shown in Figure 3 is " solving fight immediately ", comprise six characters, be respectively six words such as " standing ", " promptly ", " separating ", " determining ", " war ", " bucket ",, represent with label " 1,2,3,4,5,6,7 " respectively so the node number is seven.As shown in Figure 3, node 1 is positioned at " standing " before, node 2 is positioned between " standing " and " promptly ", node 3 is positioned between " promptly " and " separating ", node 4 is positioned between " separating " and " determining ", node 5 is positioned between " determining " and " war ", and node 6 is positioned between " war " and " bucket ", and node 7 is positioned at after " bucket " word.
Shown in Fig. 2 step 21, carry out word match by cutting processor 10 according to the word data that word storehouse 14 is had, promptly directly do cutting with the semantic irrelevant principle of participle.What deserves to be mentioned is that according to preferred embodiment of the present invention, a monosyllabic word can be regarded as in each character.So after step 21 is handled, may be cut into participle path 8 as shown in Figure 3,, be cut into " ∥ solves ∥ fight ∥ immediately " by node 1 → node 3 → node 5 → node 7; Also or participle path 9 that may be as shown in Figure 3,, be cut into " ∥ separates ∥ decisive battle ∥ bucket immediately " by node 1 → node 3 → node 4 → node 6 → node 7.Certainly also can be according to the path of node 1 → node 2 → node 3 → node 4 → node 5 → node 6 → node 7, with regard to each monosyllabic word cutting, " upright ∥ is that ∥ separates the ∥ ∥ war ∥ bucket ∥ that determines ", right this cutting result is not a shortest path usually, so do not consider.
Yet for the purpose of clearly demonstrating, at first define several terms." section point " refer between two nodes according to specific direction, from start node between endpoint node, according to after participle and the semantic irrelevant principle cutting, the node that arbitrary participle path all need be passed through." field " refers to word string included between 2 adjacent segment points." segment length " refers to the number of characters that field has.As shown in Figure 3, node 1,3,7 section of being points, " immediately " and " solving fight " belongs to a field respectively, and the segment length of this two field is respectively two and four.
If with Ma (t, L, V) 3 sections notions such as point, field and segment length of presentation graphs, then for section point 1, with Ma (1, L, V) behind the function representation, its L equals two.Character " stands " can regard a monosyllabic word as, so V (1)=1; " immediately " also can mate the formation word, thus V (2)=1, so V={1,1}.With regard to section point 3, with Ma (3, L, V) behind the function representation, its L equals four.Character " is separated " and can be regarded a monosyllabic word as, so V (1)=1; " solution " can mate the formation word, so V (2)=1; " separate decisive battle " and also can't mate the formation word, so V (3)=max; " solve fight " and can mate the formation word, so V (4)=1.Therefore, V={1,1, max, 1}.If t is the terminal point of field, then need satisfy following two conditions:
(1) carry out Ma (t-1, L, V) after, L=1, V={1}; And
(2) to arbitrary node t1, if t1<t, then carry out Ma (t1, L, V) after, t1+L≤t.
Character before condition (1) the expression field terminal point can be a monosyllabic word, but can not constitute word with a back character.Character before condition (2) the expression field terminal point can be a suffix, but can not constitute word together with last character and back one character.Therefore, according to said method, just can be according to participle and the irrelevant principle of the meaning of a word, performing step 21 carries out the coupling of word according to word storehouse 14.Because when becoming some words according to word storehouse 14 cutting Chinese sentences, only consider the coupling and the ordering of word, this participle path after word match may be unique, also may match multiple possible participle path.Fig. 3 illustrates two kinds of participle paths 8 and 9.
Minimum participle priority principle
As shown in Figure 2, step 22 judges whether the participle path after step 21 coupling is unique.If to mate the participle path that obtains be unique in institute, then directly to step 26, word segmentation result is exported through output unit 18, finish the method for this Chinese sentence segmentation; If to mate the participle path that obtains be not unique, then advance to step 23, cooperate minimum participle priority principle, in each participle path that various couplings obtain, preferentially choose the word minimum number one, in other words, the person that promptly selects the shortest path.As shown in Figure 3, there are two kinds of participle paths 8 and 9,, select the shortest one in participle path with minimum participle priority principle so advance to step 23.
Minimum participle priority principle of the present invention is to seek to solve the problem of shortest path behind the Chinese sentence segmentation that comprises n character, in brief, be seek K segment length be respectively L1, L2 ..., field such as LK shortest route problem.Therefore, can defined function F (W), it is input parameter with t for t, N, and N and W are output parameter.Wherein, t represents the node ID of Chinese sentence literal word string; N is the one dimensional numerical that comprises two elements, wherein, N[1] expression is with the field length headed by this character, N[2] represent the number of the contained word of this field.W is a variable length numerical value, and its dimension is represented the word number that shortest path comprises, and each element is represented the length of corresponding word in regular turn in the array.The time complexity of this step is directly proportional with the interior contained number of characters of statement.
(embodiment W) describes for t, N to F referring now to Fig. 4.In this process flow diagram,, also use other several variable, below they are slightly laid down a definition except using previously described several outer symbol.
R is a counter, and expression to which character among the V is operated, and at V (r)=1 o'clock, r represented that the speech that the match is successful is long;
I represents current operated node pointer;
Buffer is a participle path buffer, is used to preserve a plurality of middle participles path;
M is the initial value of W, and m=(1,1,1,1.....).
Below with reference to Fig. 4, and be that example makes an explanation to minimum participle priority principle with " table tennis bat is sold and is over ".For " table tennis bat is sold and is over " this statement, (t, L V) can be divided into two fields after the judgement: table tennis auction ∥ is over utilizing Ma.Only the processing procedure of first field " table tennis auction " is described below, can carry out in the same way the processing of other fields.
In step 41, with related variable, I, N, buffer is changed to 0, and the field starting point is pointed to node t, W=m=(1,1,1,1,1).
In step 42, I is increased 1, I+t points to field to be processed " table tennis auction ", carry out then Ma (I+t, L, V).The result who illustrated field is carried out described processing is L=4, V=(1,1,1,1).
In step 43, judge whether current field is handled, promptly whether L is 0; If current field still has been untreated, then enters step 44, otherwise change step 50 over to.For illustrative field, because therefore L=4 enters step 44.
In step 44, judge whether the character string when pre-treatment is individual character, and promptly whether L is 1; If L is 1, then do not carry out minimum word segmentation processing, return step 42.In this example, because L=4, so flow process enters step 45.
In step 45, make variable r=2, promptly the back character from current pointer character pointed begins subsequent treatment.
In step 46, judge that whether current character can be matched to phrase with the relevant character of its front, judges promptly whether V (r) equals 1.
If the judged result of step 46 then enters step 47 for being, 48 carry out path computing, otherwise change step 49 over to.
Before interpretation procedure 47,48, we at first introduce the structure of buffer.Variable buffer is used to store all paths (comprise W, but W is a special path, we will introduce it below) that will handle, the path number of representing wherein to be stored with n.Respectively with temp[0], temp[1] ..., temp[n-1] expression wherein the storage the n paths.
We use pseudo-representation step 47,48 processing procedure now.
FOR i=0; I<n; Whole paths among the i++ ∥ search buffer
IF(temp[i][0]+temp[i][1]+…+temp[i][k])==I
﹠amp; ﹠amp; Temp[i] [k]==1 ∥ searching variable k
THEN replaces temp[i with r] in since the r of k element
Individual element generates a new path;
Deposit this new route in buffer to replace temp[i];
ENDIF
In brief, a variable k corresponding with current I in the path of seeking out in buffer exactly to be stored is substituted in r element since k element in the path of being sought out with r then, generates new path.Here need to prove that though W is a path among the buffer, it is a special path, be mainly used in when seeking variable k and compare, and not by new routing update.
In step 49, judge whether the element among the V to be disposed, if dispose, then transfer to step 42, otherwise repeating step 46,47,48.
Now still with top example " table tennis auction ", description of step 46,47,48 processing procedure.
For field " table tennis auction " enter for the first time the circulation 46,47,48 o'clock, I=1, W=(1,1,1,1,1) has only a path W among the buffer.Because V (2)=1, therefore enter step 46, referring to above-mentioned pseudo-code, we find to have only k=0 just to meet Rule of judgment among the IF, therefore replace two elements that begin from k=0 among the W with r=2, because W can not be updated, so set up a new path (2,1,1,1).
Since L=4, r++=3<L, the element among the V still has been untreated, and this process is returned step 46, since V (3)=1, repeating step 47,48, and consequently k=0 produces second path (3,1,1).R++=4<=L then, this process is returned step 46 once more, because V (4)=1 enters step 47,48 once more.Wherein find k=0, produce the 3rd paths (4,1).
Next, because r++=5>L, this process is returned step 42, after having carried out step 42, I=2 (to " pang " operate), L=1, V=(1).Because L=1 represents that this word is an individual character, return step 42 through this process after the step 44.
Once more after the execution in step 42, I=3 (" ball " operated), L=2, V=(1,1).Process enters 45,46,47,48 couples of V again then, buffer, and W operation, its processing mode and top description are identical, here repeat no more, the result that I=3 is handled has stored four paths in buffer: (1,1,2,1), (2,2,1), (3,1,1), (4,1).This process is returned step 42 then.The result that I=4 is handled stores five paths in buffer: (1,1,1,2), (1,1,2,1), (2,2,1), (3,2), (4,1).This process is returned step 42 again then.
For I=5, L=1, V=(1), because L=1 does not carry out the path for individual character and handles, process is returned step 42.
For I=6, (V) afterwards, because L=0, this process is transferred to step 50 from step 43 for I+t, L to carry out Ma.
In step 50, select all shortest paths among the buffer, and deposit it in W.From top explanation, we know have five paths that in buffer wherein the shortest path has two: (3,2), (4,1).
Therefore in step 50, after 51, two paths are arranged among the W:
Path 1:W=(3,2), N[1]=5, N[2]=2;
Path 2:W=(4,1), N[1]=5, N[2]=2;
Wherein W=(3,2) represents that this field is divided into length and is respectively two speech of 3 and 2.W=(4,1) represents that this field is divided into length and is respectively two speech of 4 and 1
The word possibility quantizes principle
Yet after handling through step 23, when selected the shortest participle path was not unique, for example, statement " table tennis bat is sold and is over " was after step 23 is handled, but cutting is " the intact ∥ of table tennis ∥ auction ∥ ∥ " or " table tennis bat ∥ has sold the intact ∥ of ∥ ∥ ".The word string " table tennis auction " that view is arranged etc., through carry out F (1, N, W) after, though N[1] equal 5, N[2] equal 2, W also is two-dimentional,, one of W value be 3,2}, another be 4,1}, hence one can see that, above-mentioned two kinds of slit modes all belong to shortest path.Therefore, must carry out the calculating of word frequency weighted value again through step 25, to determine best the shortest participle path.
Be to realize that the present invention utilizes word word frequency base 16 to the affirmation in the shortest participle path of the best, the possibility of word is quantized the parameter that adopts word frequency and speech length to measure as possibility.Suppose the literal word string S=X of statement 1X 2X 3X n, X 1, X 2, X 3..., X nExpression constitutes n character of statement, and after step 23 was handled, the shortest participle path A was W with word string S cutting 1W 2W 3W kDeng k word, and the frequency of this k word is respectively P 1, P 2, P 3..., P k, its speech length is respectively L 1, L 2, L 3... L k, L 1+ L 2+ L 3+ ... + L k=n.In view of the above, define a word frequency weighting function g (S, A) as follows:
g(S,A)=f(L 1,P 1)+f(L 2,P 2)+f(L 3,P 3)+……+f(L k,P k)。
This formula is represented the word frequency weighting function with the shortest participle path A cutting word string S, and f (L P) is called word frequency weighting function about word frequency P and the long L of speech.According to the inventive method, word frequency weighting function f (L P) is defined as: if L=1, and f (L, P)=P; If L>1f (L, P)=CP, C is a constant, preferably the positive integer more than 5.Therefore, as long as determine suitable word frequency weighting function f (L 1, P 1), f (L 2, P 2), f (L 3, P 3) ... or f (L k, P k) etc., can calculate the word frequency weighting numerical value of each cutting word, again with the word frequency weighting numerical value addition of each word, just can may the cutting results carry out the ordering of possibility to various, select in the shortest participle path possibility the highest in view of the above.Therefore, step 25 is selected best in the shortest path one according to the word word frequency, and according to preferred embodiment of the present invention, (S A) be of maximum to the word frequency weighting function g that selects exactly to obtain after the word frequency weighting numerical value addition with each word.
If above predicate sentence " table tennis bat is sold and is over " be example, after step 23 is handled, two kinds of cutting situations are arranged, be respectively " the intact ∥ of table tennis ∥ auction ∥ ∥ " or " table tennis bat ∥ sold ∥ finished ∥ ∥ ".If word " table tennis ", " auction ", " End ", " " corresponding word frequency is respectively 0.00080,0.00019,0.03425,1.81942 etc., constant C equals 7, and the word frequency weighting numerical value of then cutting " table tennis ∥ auctions ∥ and finished ∥ " is:
g1=0.00080×7+0.00019×7+0.03425+1.81942=1.8606;
If word " table tennis bat ", " selling ", " End ", " " corresponding word frequency does 0.00012,0.01127,0.03425,1.81942 etc. respectively, constant C equals 7, and the word frequency weighting numerical value of then cutting " table tennis bat ∥ has sold the intact ∥ of ∥ ∥ " is:
g2=0.00012×7+0.01127+0.03425+1.81942=1.86578。
Because g2>g1, so select cutting result " table tennis bat ∥ has sold the intact ∥ of ∥ ∥ " to export.
Certainly, with the foundation in word frequency weighting numerical value the shortest best alternatively participle path, its degree of accuracy will rely on the word frequency data of each word in the word word frequency base.The word frequency data comes from statistics and the inventive method of in fact language being used and has nothing to do, so for how setting up the word word frequency base do not repeat them here.
Comprehensively above-mentioned, the method of Chinese sentence segmentation of the present invention adopts participle and the irrelevant principle of semanteme and the minimum participle priority principle that need through the cutting statement, sorts according to the length of cutting required working time, preferential selection working time, the shortest person sought the shortest participle path.If the shortest participle path that is obtained is not unique, then the possibility with word quantizes principle, the parameter that adopts word frequency and speech length to measure as possibility, calculate the word frequency weighting numerical value in each the shortest participle path, in other words, be exactly according to the ordering of possibility size, select best the shortest participle path.Therefore, the method for Chinese sentence segmentation of the present invention has short, cutting word degree of accuracy advantages of higher working time simultaneously concurrently.
Though the present invention is illustrated in the mode of a preferred embodiment; yet its purpose and unrestricted the present invention; those of ordinary skill in the art is under the situation that does not break away from the spirit and scope of the present invention; can carry out various modifications and distortion, so protection scope of the present invention should be as the criterion with appending claims.

Claims (8)

1. a Chinese sentence segmentation method comprises the following steps:
(a) input one Chinese sentence, this Chinese sentence is made up of a plurality of character
A literal word string;
(b) according to a word memory storage, this literal word string is carried out word match,
Adopt the irrelevant principle of semanteme of word and this statement;
(c) judge whether the participle path after coupling is unique; If then finish
This Chinese sentence segmentation process; Otherwise,
(d) select the shortest person in this participle path, adopt minimum participle priority principle;
(e) judge whether this shortest participle path is unique, if then finish this Chinese
Language sentence segmentation process; Otherwise
(f), select this shortest participle path can according to a word word frequency memory storage
Energy property soprano decides with word frequency weighting numerical value; And
(g) export the highest the shortest participle path of this possibility.
2. method as claimed in claim 1, wherein, this word frequency weighting numerical value is according in this shortest participle path, mate and the speech of this word of obtaining is long decides with word frequency.
3. method as claimed in claim 2, wherein, if this word is a monosyllabic word, this word frequency weighting numerical value only comprises corresponding this word frequency; If this word is non-monosyllabic word, this word frequency weighting numerical value product that is corresponding this word frequency and a weighting constant then.
4. method as claimed in claim 3, wherein, this weighting constant is the positive integer greater than 5.
5. Chinese sentence segmentation system comprises:
An one word memory storage and a word word frequency memory storage, they are provided at respectively mates required word and word frequency data when cutting is handled;
One input media is used to import a Chinese sentence;
All divisional processing devices, receive this Chinese sentence, data according to this word in this word memory storage, carry out word match in regular turn, select shortest path and word frequency weighting numerical evaluation, in this word match, adopt the irrelevant principle of semanteme of word and this statement, select to adopt in the shortest path minimum participle priority principle at this; If carrying out this word match, selecting in shortest path and the word frequency weighting numerical evaluation, resulting participle path is unique, promptly belongs to a word segmentation result; And
One output unit is in order to export this word segmentation result.
6. Chinese sentence segmentation as claimed in claim 5 system, wherein, this word frequency weighted value is according in this shortest participle path, mate and the speech of this word of obtaining is long decides with word frequency.
7. Chinese sentence segmentation as claimed in claim 6 system, wherein, if this word is a monosyllabic word, this word frequency weighting numerical value only comprises corresponding this word frequency; If this word frequency is non-monosyllabic word, this word frequency weighting numerical value product that is corresponding this word frequency and a weighting constant then.
8. Chinese sentence segmentation as claimed in claim 7 system, wherein, this weighting constant is the positive integer greater than 5.
CN98118413A 1998-08-13 1998-08-13 Method for chinese sentence segmentation and its system Expired - Fee Related CN1086821C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN98118413A CN1086821C (en) 1998-08-13 1998-08-13 Method for chinese sentence segmentation and its system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN98118413A CN1086821C (en) 1998-08-13 1998-08-13 Method for chinese sentence segmentation and its system

Publications (2)

Publication Number Publication Date
CN1204811A CN1204811A (en) 1999-01-13
CN1086821C true CN1086821C (en) 2002-06-26

Family

ID=5226041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN98118413A Expired - Fee Related CN1086821C (en) 1998-08-13 1998-08-13 Method for chinese sentence segmentation and its system

Country Status (1)

Country Link
CN (1) CN1086821C (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101379128B1 (en) * 2012-02-28 2014-03-27 라쿠텐 인코포레이티드 Dictionary generation device, dictionary generation method, and computer readable recording medium storing the dictionary generation program
CN103559178A (en) * 2013-05-31 2014-02-05 武汉中文百科网络有限公司 System and method for switching between simplified Chinese characters and traditional Chinese characters on Internet
CN110221707A (en) * 2018-03-01 2019-09-10 北京搜狗科技发展有限公司 A kind of English input method, device and electronic equipment
CN109582972B (en) * 2018-12-27 2023-05-16 信雅达科技股份有限公司 Optical character recognition error correction method based on natural language recognition
CN110705261B (en) * 2019-09-26 2023-03-24 浙江蓝鸽科技有限公司 Chinese text word segmentation method and system thereof

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1152749A (en) * 1996-01-30 1997-06-25 陈肇雄 Fully automatic system for separating Chinese words from sentences

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1152749A (en) * 1996-01-30 1997-06-25 陈肇雄 Fully automatic system for separating Chinese words from sentences

Also Published As

Publication number Publication date
CN1204811A (en) 1999-01-13

Similar Documents

Publication Publication Date Title
WO2020182019A1 (en) Image search method, apparatus, device, and computer-readable storage medium
US10282419B2 (en) Multi-domain natural language processing architecture
CN110750704B (en) Method and device for automatically completing query
JP3672242B2 (en) PATTERN SEARCH METHOD, PATTERN SEARCH DEVICE, COMPUTER PROGRAM, AND STORAGE MEDIUM
US20190347281A1 (en) Apparatus and method for semantic search
US20140229473A1 (en) Determining documents that match a query
CN112035511A (en) Target data searching method based on medical knowledge graph and related equipment
CN1977261A (en) Method and system for word sequence processing
JP2006506692A (en) A new computer-aided memory translation scheme based on template automata and latent semantic indexing principle
CN109933216B (en) Word association prompting method, device and equipment for intelligent input and computer storage medium
CN106407184B (en) Coding/decoding method, statistical machine translation method and device for statistical machine translation
CN112231453B (en) Intelligent question-answering method and device, computer equipment and storage medium
CN1648901A (en) Method and system for large scale keyboard matching
CN106294460A (en) A kind of Chinese speech keyword retrieval method based on word and word Hybrid language model
CN1086821C (en) Method for chinese sentence segmentation and its system
JP7193000B2 (en) Similar document search method, similar document search program, similar document search device, index information creation method, index information creation program, and index information creation device
CN104008097A (en) Method and device for achieving query understanding
CN113742292A (en) Multi-thread data retrieval and retrieved data access method based on AI technology
Widad et al. Bert for question answering applied on covid-19
CN114547251B (en) BERT-based two-stage folk story retrieval method
CN110968666A (en) Similarity-based title generation model training method and computing equipment
CN1201285C (en) Parallel searching methd for speech recognition
CN107391574B (en) Chinese ambiguity segmentation method based on ontology and group intelligent algorithm
Solomonott Inductive Inference Theory-A Unified Approach to Problems in Pattern Recognition and Artificial Intelligence.
CN109254983B (en) Cost reduction method in crowdsourcing TOP-k query

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20020626

Termination date: 20100813