CN101071421A - Chinese word cutting method and device - Google Patents

Chinese word cutting method and device Download PDF

Info

Publication number
CN101071421A
CN101071421A CN 200710102082 CN200710102082A CN101071421A CN 101071421 A CN101071421 A CN 101071421A CN 200710102082 CN200710102082 CN 200710102082 CN 200710102082 A CN200710102082 A CN 200710102082A CN 101071421 A CN101071421 A CN 101071421A
Authority
CN
China
Prior art keywords
participle
sentence
words
distance
tail
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200710102082
Other languages
Chinese (zh)
Inventor
王启明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN 200710102082 priority Critical patent/CN101071421A/en
Publication of CN101071421A publication Critical patent/CN101071421A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a Chinese word segmentation method. Including: the sub-term participant weights, the word referred to the sub-set of words in the sentence in accordance with the terms of the location of sorting. The words from the set the last word at the beginning, the current record-term value of the right word to the previous sentence at the end of the distance. As the current term to the end of the sentence at the distance, marking the first word segmentation and segmentation of the current relations between the stitching until the words are set in the first term to the end of the sentence at the distance. And the word referred to a sub-section with the words of the former sub-stitching; points from the word referred to a sub-set of words, referred to choose The shortest distance between the end of the sentence the word at the first sentence, the word referred to the first sentence of the first word of the sentence referred to the first word from the first sentence referred to the words,, in accordance with Mosaic, followed by access splicing relations marked by the first-term until the end of the sentence. The present invention also open a Chinese word segmentation devices. The above methods or devices reduces the complexity.

Description

Chinese word segmentation method and device
Technical field
The present invention relates to the Chinese information processing field, relate in particular to a kind of Chinese word segmentation method and the device that use in the Chinese information processing field.
Background technology
For Chinese, minimum, can be independently movable, significant language element is a speech, and speech is made of single or multiple words, and general with the most use is two words, secondly, is monosyllabic word, also has some multi-character words (as Chinese idiom, proper noun etc.) in addition.But Chinese is basic grapheme with word, there is not similar English space and so on to be used to identify the symbol on speech border between speech and the speech, therefore, each sentence in the Chinese text is carried out participle, just being discerned the border of speech in the sentence automatically by machine, is the problem that at first will solve in the Chinese text analyzing and processing.
At present, segmenting method commonly used comprises: full cutting segmenting method, and maximum match is divided morphology, and shortest path divides morphology etc.Wherein, shortest path divides the basic thought of morphology to be: according to dictionary, find out all possible speech in the sentence, construct the directed acyclic graph of these speech, a directed edge in each speech corresponding diagram, and compose the corresponding length of side (also being called weights) for these directed edges, at this moment, the participle problem is converted to the problem of finding the solution from the origin-to-destination shortest path.
If can the starting point from figure reach home, then claim to exist between these two somes a paths.Generally, may there be mulitpath from origin-to-destination, and the limit number of process might not be identical on every paths, therefore, path equals the summation of the weights on each limit on the path, that the shortest paths of path is called shortest path between Origin And Destination, and its path is called shortest path length.For finding the solution shortest route problem, can adopt the existing algorithm that is used to find the solution shortest route problem, such as, dijkstra's algorithm.
After solving shortest path, the pairing speech in the limit of this path process is the result of this participle.Noticeable, shortest path divides the shortest path that solves in the morphology to have many.
Below briefly narrated the ultimate principle that shortest path divides morphology, below illustrated shortest path and divide morphology.
Suppose sentence S=C1 C2 Ci Cn, wherein, Ci (i=1,2,, n) being single word, n is the length of sentence S, n 〉=1.The directed acyclic graph that to set up a node number be n+1, as shown in Figure 1, each node serial number is followed successively by V0, V1, V2,, Vn.
Set up all possible directed edge of directed acyclic graph by following two kinds of methods:
(1) adjacent node V K-1, V kBetween set up directed edge<V K-1, V k, the weights on limit are L k, the speech of limit correspondence is defaulted as C k(k=1,2,, n);
(2) if w=Ci Ci+1 Cj is a speech in the dictionary, wherein, 0<i<j≤n then need be at node V I-1, V jBetween set up directed edge<V I-1, V j, the weights on limit are L w, the speech of limit correspondence is w, like this, all speech that comprise among the S are corresponding one by one with the limit in the directed acyclic graph.According to the weights on the limit that indicates in the directed acyclic graph, call the algorithm that is used to find the solution shortest route problem, calculate the starting point V from figure 0To terminal point V nShortest path, obtain the word segmentation result of S.
If S is specially " he says really reason really ", according to dictionary, finding out all possible speech of this sentence is: he, say,, really, really, tangible, resonable, the reason, the directed acyclic graph that is made of these speech as shown in Figure 2, for ease of calculating, the weights of supposing these speech corresponding sides are 1, and so, 7 have following a few paths from starting point 0 to terminal point:
Path 1: through node 0,1,2,3,4,5,6,7, this path comprises 7 directed edges, and the length in path is 7;
Path 2: through node 0,1,2,4,5,6,7, this path comprises 6 directed edges, and path is 6;
Path 3: through node 0,1,2,4,6,7, this path comprises 5 directed edges, and path is 5;
Path 4: through node 0,1,2,3,5,6,7, this path comprises 6 directed edges, and path is 6;
Path 5: through node 0,1,2,3,5,7, this path comprises 5 directed edges, and path is 5;
More above-mentioned path as can be known, path 3 is the shortest with the length in path 5, therefore, the word segmentation result of this sentence is: he, says, really, really, reason, perhaps, he, say,, certain, resonable.
As the above analysis, because existing shortest path divides the morphology needs will be converted to the form of directed acyclic graph according to the original minute set of words that dictionary obtains, and need call the algorithm of finding the solution shortest path, make to be not easy to word segmentation processing process more complicated realize.
Summary of the invention
The technical problem to be solved in the present invention provides a kind of simple Chinese word segmentation method.
For solving the problems of the technologies described above, the objective of the invention is to be achieved through the following technical solutions:
A kind of Chinese word segmentation method comprises:
Be that the participle in minute set of words composes weights, the participle in described minute set of words is according to the position ordering of participle in sentence;
Last participle from described minute set of words, write down the weights of current participle and its preceding participle to the distance of sentence tail and, as the distance of current participle to the sentence tail, mark should be in the splicing relation of preceding participle and current participle, first participle reaches described first participle and its splicing relation at preceding participle to the distance of sentence tail in obtaining the branch set of words; Wherein, described at preceding participle be all of described current participle in preceding participle to the distance of sentence tail the shortest at preceding participle;
From first participle of described minute set of words, select the described the shortest beginning of the sentence participle of distance to the sentence tail, first word of described beginning of the sentence participle is first word of described sentence;
From described beginning of the sentence participle, according to described splicing relation, obtain successively institute's mark in the splicing relation at preceding participle, finish up to sentence.
Preferably, said method further comprises:
Sentence is carried out word segmentation processing, obtain the branch set of words of described sentence;
According to the position of participle in sentence, the participle in the described minute set of words is sorted the branch set of words after obtaining sorting.
Preferably, if at least two of current participle equate and the shortest to the distance of sentence tail at preceding participle, then from wherein be chosen in the branch set of words sorting position forward at preceding participle, be that this current participle concerns in splicing of the preceding mark of word segmentation with described.
Preferably, if having at least two beginning of the sentence participles equal and the shortest to the distance of sentence tail;
The forward beginning of the sentence participle of sorting position from wherein be chosen in the branch set of words.
Preferably, the weights composed of each participle in the branch set of words equate or are unequal.
A kind of Chinese word segmentation device comprises:
The assignment unit is used to the participle in the branch set of words to compose weights, and the participle in described minute set of words is according to the position ordering of participle in sentence;
Record cell, be used for last participle from described ordering, write down the weights of current participle and its preceding participle to the distance of sentence tail and, as the distance of current participle to the sentence tail, mark should be in the splicing relation of preceding participle and current participle, first participle reaches described first participle and its splicing relation at preceding participle to the distance of sentence tail in obtaining the branch set of words; Wherein, described at preceding participle be all of described current participle in preceding participle to the distance of sentence tail the shortest at preceding participle;
Beginning of the sentence participle selected cell is used for first participle from minute set of words, selects the described the shortest beginning of the sentence participle of distance to the sentence tail, and first word of described beginning of the sentence participle is first word of described sentence;
The participle selected cell is used for from described beginning of the sentence participle, according to described splicing relation, obtain successively institute's mark in the splicing relation at preceding participle, finish up to sentence.
Preferably, said apparatus further comprises:
Sentence rough segmentation unit is used for sentence is carried out word segmentation processing, obtains the branch set of words of described sentence;
Sequencing unit is used for the participle in the described minute set of words being sorted the branch set of words after obtaining sorting according to the position of participle at sentence.
Preferably, if at least two of current participle are equal and the shortest to the distance of sentence tail at preceding participle, then
Described record cell, be used for from wherein be chosen in branch set of words sorting position forward at preceding participle, be this current participle and described in splicing relation of the preceding mark of word segmentation.
Preferably, if having at least two beginning of the sentence participles equal and the shortest, then to the distance of sentence tail
Described beginning of the sentence participle selected cell is used for from wherein selecting the forward beginning of the sentence participle of branch set of words sorting position.
Preferably, the weights composed for each participle in the branchs set of words of described assignment unit are equal or unequal.
Above technical scheme as can be seen, because in the segmenting method that the embodiment of the invention provides, last participle from minute set of words begins, obtained the splicing relation of each participle and current participle in the branch set of words successively, wherein, described at preceding participle be all of described current participle in preceding participle to the distance of sentence tail the shortest at preceding participle, first participle from minute set of words begins then, according to described splicing relation, obtain successively mark in the splicing relation at preceding participle, obtain word segmentation result, therefore, the method that the embodiment of the invention provided has adopted two circulations can obtain word segmentation result, makes the participle process become simply, has reduced the complexity of word segmentation processing process.
Description of drawings
Figure 1 shows that shortest path divides the directed acyclic graph that adopts in the morphology;
Figure 2 shows that the directed acyclic graph that obtains according to " he says really reason really ";
Figure 3 shows that the embodiment of the invention provides the Chinese word segmentation method flow diagram;
Figure 4 shows that the Chinese word segmentation device composition synoptic diagram that the embodiment of the invention provides.
Embodiment
The Chinese word segmentation method that the embodiment of the invention provides is to pick out m speech from the branch set of words that comprises n speech, m≤n, and the end to end back of this m speech constitutes a complete sentence, and does not have unnecessary character.Therefore, above-mentioned minute set of words is often referred to the more detailed branch set of words that has redundancy.In addition, each speech in the branch set of words also can be called participle.
Wherein, the so-called more detailed branch set of words that has redundancy is commonly referred to as and adopts full cutting segmenting method that certain sentence is carried out resulting minute set of words of word segmentation processing.Full cutting segmenting method is to come out with all possible word segmentation of dictionary matching in the sentence.
Such as, sentence " he says really reason really ", the branch set of words that adopts full cutting segmenting method to obtain can be " he, say,, really, really, really, in fact, really,, resonable, the reason ", this minute, set of words comprised this sentence all possible speech in dictionary substantially, therefore, this minute, set of words was a more detailed branch set of words, and since " really " this speech can be adjacent " " and " reality " formation two speech " really " and " really ", so " really " can be called as the participle that has redundancy.Hence one can see that, the branch set of words that full cutting segmenting method obtains generally be one more detailed and have a branch set of words of redundancy.
The method that the embodiment of the invention provides then is that above-mentioned more detailed and branch set of words that have a redundancy is handled, below the branch set of words that occurs in embodiments of the present invention for the ease of narration all refer to above-mentioned more detailed and branch set of words that have redundancy.
Below in conjunction with the accompanying drawing specific embodiment that develops simultaneously the method that the embodiment of the invention provides is described in detail.
The embodiment of the invention provides a kind of Chinese word segmentation method, and as shown in Figure 3, this method comprises:
Step 301: be that each participle in minute set of words composes weights, each participle in this minute set of words is according to the position ordering of participle in sentence;
When specific implementation, can compose unequal weights for each participle, perhaps,, can compose the weights that equate for each participle for ease of calculating, do not influence the realization of the embodiment of the invention.
Branch set of words in the step 301 is the more detailed branch set of words that has redundancy, and the participle in the branch set of words need sort in the position of sentence according to participle.
If the participle in the branch set of words is according to the position ordering of participle in sentence, such as, sentence " he says really reason really ", according to dictionary, the branch set of words that obtains is: he, say,, really, really, tangible, resonable, the reason, participle in this minute set of words according to the position ordering of each participle in sentence, does not then need to comprise the step that minute set of words is sorted in the method that the embodiment of the invention provides.
If the participle in the branch set of words does not sort according to the position of participle in sentence, then before step 301, need further to comprise step 301:
Step 301:, the participle in minute set of words is sorted according to the position of each participle in sentence.
Step 302: last participle from minute set of words begins, and all that obtain current participle are in the distance of preceding participle to the sentence tail, therefrom choose the distance of a tail the shortest at preceding participle;
Wherein, refer to, be positioned at after the current participle, and be close to the participle of this current participle according to the sequential write of sentence at preceding participle.
And, divide the participle in the set of words to sort, so last participle in the set of words of branch described in the embodiment of the invention refers to: the participle of conduct sentence tail participle in sentence according to the sequence of positions of participle in sentence.
Illustrate, such as, the branch set of words of sentence " he say really reason " really for " he, says,, really, really, tangible, resonable, manage ", participle in this minute set of words is according to the position ordering of participle in sentence, clearly last participle in this minute set of words refers to " reason ", and " reason " is the sentence tail participle of sentence " he says really reason really ".
Such as, sentence " he says really reason really ", the branch set of words of supposing this sentence is: he, say,, really, really, really,, resonable, the reason, then " he's " is " saying " at preceding participle, " really " be " really " at preceding participle, the having two at preceding participle and be respectively of " really " " " and " resonable ", " resonable " and " reason " is the sentence tail speech of this sentence, therefore, " resonable " and " reason " is sentence tail speech, it is at preceding participle, during specific implementation, can be with sentence tail speech be defined as the end mark of this sentence at preceding participle.
By above-mentioned example as can be seen, having at preceding participle of current participle is a plurality of, and therefore, what the distance of selecting in the step 302 to the sentence tail was the shortest may also have a plurality of at preceding participle.
Step 303: with the weights of current participle add select in the step 302 in the distance of preceding participle to the sentence tail, obtain the distance of current participle to the sentence tail, write down this distance value, this current participle of mark and its are to the shortest splicing relation at preceding participle of the distance of sentence tail;
Wherein, alleged in embodiments of the present invention splicing relation refers to the neighbouring relations of two participles, just current participle and its concern in the reduction sentence period of the day from 11 p.m. to 1 a.m residing position in sentence at preceding participle, such as, at this moment, current participle is " really ", " really " have two at preceding participle, for " resonable " and " ", suppose " resonable " to the distance of sentence tail less than " " to the distance of sentence tail, " really " equals its weights to the distance of sentence tail and adds that " resonable " arrives distance of a tail, supposes that this distance equals 2, " really " can be labeled as with the splicing relation of " resonable ": [certain 2]-and resonable, from this splicing relation, can know " really " distance to the sentence tail, and, " really " and " resonable " be at the reduction sentence period of the day from 11 p.m. to 1 a.m, and " " resonable " is positioned at " really " afterwards and be close to " really " really.
Current participle is recorded in the splicing relation to the distance of sentence tail in above-mentioned example, in other embodiments of the invention, also can write down the distance of current participle separately, therefore, current participle is recorded in the realization that where does not influence the embodiment of the invention to the distance of distance to the sentence tail.
And, can learn that from step 302 current participle adds that to the weights that the distance of sentence tail equals this current participle it is in the distance of preceding participle to the sentence tail, because calculate current participle to the sentence tail apart from the time, need known current participle in the distance of preceding participle to the sentence tail, so need last participle from minute set of words begin to calculate each participle in the branch set of words to the distance of sentence tail, just from sentence, beginning to calculate in the branch set of words each participle to the distance of sentence tail as the participle of sentence tail participle, because sentence tail participle is not at preceding participle, so last participle in the branch set of words equals its weights to the distance of sentence tail, can reflect thus, current participle adds up to the actual weights that equal a plurality of participles of distance of sentence tail.
If what the distance of selecting in the step 302 to the sentence tail was the shortest is a plurality of at preceding participle, then in step 303, can for current participle and its each to the distance of sentence tail the shortest in splicing relation of the preceding mark of word segmentation, perhaps, from a plurality of to the distance of sentence tail the shortest preceding participle, select one arbitrarily at preceding participle, for current participle and this concern in splicing of the preceding mark of word segmentation, perhaps, for improving the accuracy of participle, from a plurality of to the distance of sentence tail the shortest preceding participle, select one in minute set of words sorting position forward at preceding participle, for current participle with should be in one of preceding mark of word segmentation splicing relation.
Step 304: whether the current participle in the determining step 303 is first participle in the branch set of words, if, enter step 305, if not, then return step 302;
Illustrate the implementation of step 302 to step 304, such as, sentence " he says really reason really ", suppose according to dictionary, the more detailed branch set of words that has redundancy of this sentence that obtains is: he, he say, say, say,, really, really, really, in fact, really,, resonable, the reason, this minute, set of words according to the position ordering of participle in sentence, was supposed to compose the weights that equate for each participle in the step 301, and weights equal 1.
Dividing last participle in the set of words is " reason ", and from " reason " beginning execution in step 302, at this moment, " reason " is current participle.Because " reason " be a sentence tail speech, so it is not at preceding participle, therefore, " reason " just equals itself weights to the distance of a tail, equals 1, " reason " and concern and can be labeled as in the splicing of preceding participle: [managing 1]-end; Because current participle is " reason " is not first participle in the branch set of words, returns step 302;
At this moment, current participle is " resonable ", " resonable " still is sentence tail speech, so it is not at preceding participle, therefore, " resonable " just equals itself weights to the distance of sentence tail, equals 1, " resonable " with can be labeled as in the splicing of preceding participle relation: [resonable 1]-end, in like manner return step 302;
At this moment, current participle be " ", " and " have only " reason " at preceding participle, " " equal its weights to the distance of sentence tail and add " reason " distance to tail, equal 2, " " can be labeled as with the splicing of " reason " relation: [2]-manage, return step 302;
At this moment, current participle is " really ", " really " have only " reason " at preceding participle, " really " equals its weights to the distance of sentence tail and adds " reason " distance to tail, equal 2, " really " can be labeled as with the splicing relation of " reason ": [tangible 2]-reason, return step 302;
At this moment, current participle is " reality ", " reality " have two at preceding participle, be respectively " resonable " and " ", by the calculating of front as can be known " resonable " be 1 to the distance of sentence tail, " " to the sentence tail distance be 2, because " resonable " to the sentence tail distance less than " " to the sentence tail distance, so, " reality " to the sentence tail distance equal its weights add " resonable " to the sentence tail distance, equal 2, " reality " can be labeled as with the splicing relation of " resonable ": [real 2]-resonable, return step 302;
At this moment, current participle is " really ", " really " have two at preceding participle, for " resonable " and " ", since " resonable " to the distance of sentence tail less than " " to the distance of sentence tail, " really " equals its weights to the distance of sentence tail and adds that " resonable " arrives distance of a tail, equals 2, " really " can be labeled as with the splicing relation of " resonable ": [certain 2]-resonable, return step 302;
At this moment, current participle is " really ", " really " have two at preceding participle, be " reality " and " really " that by the calculating of front as can be known, " reality " equals " really " distance to a tail to the distance of sentence tail, be equal to 2, so, " really " equal its weights to the distance of sentence tail and add that " reality " or " really " arrives distance of a tail, equals 3; If only need be splicing relation of the current mark of word segmentation, then in order to improve the accuracy of participle, from " reality " and " really ", be chosen in the forward participle " reality " of sorting position in the branch set of words, " really " splicing relation with " reality " can be labeled as: [true 3]-real, return step 302; Perhaps, do not concern quantity, then can be two splicing relations of " really " mark if limit the splicing of current participle: [true 3]-real, [true 3]-tangible, then, return step 302;
At this moment, current participle is " really ", " really " have two at preceding participle, be " reality " and " really ", " reality " equals " really " distance to the sentence tail to the distance of sentence tail, " really " equal its weights to the distance of sentence tail and add that " reality " or " really " arrives distance of a tail, equal 3, in like manner, if only need be for splicing relation of " really " mark, then in order to improve the accuracy of participle, from reality " and " really " be chosen in the forward participle " reality " of sorting position the branch set of words; " really " can be labeled as with " reality " splicing relation: [really 3]-reality, then, return step 302; Do not concern quantity if do not limit the splicing of " really ", then can be two splicing relations of " really " mark: [really 3]-real, [really 3]-tangible, then, return step 302;
In like manner can get, " " close in splicing of preceding participle with it and to be: [3]-really; " say " to close in splicing of preceding participle and be: [say 3]-really with it; " say " with its splicing pass and be at preceding participle: [saying 4]-; " he says " closes with its splicing at preceding participle: [he says 4]-; " he " closes with its splicing at preceding participle: [he 4]-say, obtaining " he " distance to the sentence tail, and " he " with its after splicing of preceding participle relation, step 304 judges that the current participle " he " that obtains in the step 303 for first participle in the branch set of words, then enters step 305;
Step 305: first participle from minute set of words begins, and obtains the participle that can be used as the beginning of the sentence participle, therefrom chooses the shortest participle of the distance of a tail as the beginning of the sentence participle, and first word of beginning of the sentence participle is first word of sentence;
Wherein, the participle that can be used as the beginning of the sentence participle has a plurality of, so in a plurality of participles that can be used as the beginning of the sentence participle, to the sentence tail the shortest participle of distance may also have a plurality of, in the time only need exporting a word segmentation result, in order to improve the accuracy of participle, can be from above-mentioned a plurality of to the shortest participle of the distance of sentence tail, be chosen in a forward participle of sorting position in the branch set of words as the beginning of the sentence participle, perhaps select a participle arbitrarily, do not influence the realization of the embodiment of the invention as the beginning of the sentence participle; If can export a plurality of word segmentation result, then can select above-mentioned a plurality of the shortest participle of distance to the sentence tail.
Still with the branch set of words of sentence " he say really reason " really " he, he say, say, say,, really, really, really, in fact, really,, resonable, manage " be example, the participle that can be used as the beginning of the sentence participle has two: " he " and " he says ", wherein, " he " is 4 to the distance of sentence tail, " he says " also is 4 to the distance of sentence tail, if only need to export a word segmentation result this moment, and need to guarantee the accuracy of participle, then be chosen in the forward participle of ordering in the branch set of words as the beginning of the sentence participle, promptly select " he " as the beginning of the sentence participle; If allow a plurality of word segmentation result of output, then " he " and " he says " all can be chosen.
Step 306: obtain the beginning of the sentence participle from step 305, according to the splicing relation of step 303 record, obtain successively institute's mark in the splicing relation at preceding participle, finish up to sentence.
Prolong with above-mentioned participle set, if the beginning of the sentence participle of selecting in the step 305 is " he ", close according to the splicing of " he " of step 303 record and to be: [he 4]-say, what obtain " he " from this splicing relation is " saying " at preceding participle, the splicing " said " is closed: [say 3]-really, " the saying " that obtains from this splicing relation be " really " at preceding participle, the splicing of " really " is closed: [certain 2]-and resonable, " really " that obtains from this splicing relation be " resonable " at preceding participle, the splicing of " resonable " is closed: [resonable 1]-end, " resonable " that obtain from this splicing relation is the end of sentence, therefore, the word segmentation result that finally obtains is: he-say-really-resonable;
If the beginning of the sentence participle of selecting in the step 305 is " he " and " he says ", then can obtain two word segmentation result: he-say-really-resonable, reach him to say--really-resonable.
By above-mentioned word segmentation result as can be seen, the end to end back of the participle in the word segmentation result constitutes " he says really reason really ", and does not have unnecessary word.
The embodiment of the invention also provides a kind of Chinese word segmentation device, as shown in Figure 4, comprising:
Assignment unit 401 is used to the participle in the branch set of words to compose weights, and the participle in this minute set of words is according to the position ordering of participle in sentence;
Wherein, the weights composed of each participle in the branch set of words can equate or can be unequal; Branch set of words in the assignment unit can be to adopt certain segmenting method, such as, maximum matching method, critical path method (CPM) etc., sentence is carried out word segmentation processing obtain the more detailed branch set of words that has redundancy, generally speaking, employing is searched the method for all speech in dictionary in the sentence according to dictionary, obtains above-mentioned minute set of words.
Record cell 402, be used for last participle from minute set of words, write down the weights of current participle and its preceding participle to the distance of sentence tail and, as the distance of current participle to the sentence tail, this current participle of mark and its be in the splicing relation of preceding participle, in obtaining the branch set of words first participle to the distance of sentence tail with and with its splicing relation at preceding participle; Wherein, should preceding participle be all of this current participle in preceding participle to the distance of a tail the shortest at preceding participle;
If current participle has at least two to equate and the shortest that to the distance of sentence tail then record cell 402 at preceding participle, can be used to current participle and its each to the shortest the concerning of the distance of sentence tail in splicing of the preceding mark of word segmentation;
Perhaps, record cell 402, be used for from a plurality of to the distance of sentence tail the shortest at preceding participle, select one arbitrarily at preceding participle, for current participle with should be in one of preceding mark of word segmentation splicing relation;
Perhaps, for improving the accuracy of participle, record cell 402 from a plurality of to the distance of sentence tail the shortest preceding participle, select one in minute set of words sorting position forward at preceding participle, for current participle and this concern in splicing of the preceding mark of word segmentation, do not influence the realization of the embodiment of the invention.
Beginning of the sentence participle selected cell 403 is used for first participle from minute set of words, obtains the participle that can be used as the beginning of the sentence participle, therefrom chooses the shortest participle of the distance of a tail as the beginning of the sentence participle;
Wherein, first word that can be used as the participle of beginning of the sentence participle is first word of sentence;
If have at least two beginning of the sentence participles equal and the shortest to the distance of sentence tail, then beginning of the sentence participle selected cell 403 can be selected a plurality of the shortest participles that can be used as the beginning of the sentence participle of distance to the sentence tail;
Perhaps, beginning of the sentence participle selected cell 403 to the shortest participle that can be used as the beginning of the sentence participle of the distance of sentence tail, selects a participle as the beginning of the sentence participle from a plurality of arbitrarily;
Perhaps, for improving the accuracy of participle, beginning of the sentence participle selected cell 403 can be from a plurality of to the shortest participle that can be used as the beginning of the sentence participle of the distance of sentence tail, be chosen in a forward participle of sorting position in the branch set of words as the beginning of the sentence participle, do not influence the realization of the embodiment of the invention.
Participle selected cell 404 is used for the beginning of the sentence participle selected from beginning of the sentence participle selected cell 403, according to the splicing relation of record cell 402 records, obtain successively institute's mark in the splicing relation at preceding participle, finish up to sentence.
Illustrate the course of work of above-mentioned Chinese word segmentation device.
Such as, sentence " he says really reason really ", the branch set of words of supposing this sentence is: he, say,, really, really, really,, resonable, the reason, then " he's " is " saying " at preceding participle, " really " be " really " at preceding participle, the having two at preceding participle and be respectively of " really " " " and " resonable ", " resonable " and " reason " is the sentence tail speech of this sentence, therefore, " resonable " and " reason " not at preceding participle, it can be defined as the end mark of this sentence at preceding participle.
For ease of calculating, suppose assignment unit 401, compose equal weights for each participle in above-mentioned minute set of words, weights equal 1.
If only be one of each mark of word segmentation splicing relation, then each participle in the branch set of words that obtains of record cell 402 is to the distance of sentence tail, and each participle with its splicing pass at preceding participle is:
[reason 1]-end; [resonable 1]-end; [2]-reason; [tangible 2]-reason; [real 2]-resonable; [certain 2]-resonable; [true 3]-real; [really 3]-real; [3]-really; [say 3]-really; [saying 4]-; [he says 4]-; [he 4]-say.
The participle that can be used as the beginning of the sentence participle has two: " he " and " he says ", wherein, " he " is 4 to the distance of sentence tail, " he says " also is 4 to the distance of sentence tail, if only need export a word segmentation result, and needs guarantee the accuracy of participle, then beginning of the sentence participle selected cell 403, from " he " and " he says ", be chosen in the forward participle of ordering in the branch set of words as the beginning of the sentence participle, promptly select " he " as the beginning of the sentence participle.
If beginning of the sentence participle selected cell 403, the beginning of the sentence participle of selecting is " he ", participle selected cell 404 closes according to the splicing of " he " of record in the record cell 402: [he 4]-say, what obtain " he " from this splicing relation is " saying " at preceding participle, the splicing " said " is closed: [say 3]-really, " the saying " that obtains from this splicing relation be " really " at preceding participle, the splicing of " really " is closed: [certain 2]-and resonable, " really " that obtains from this splicing relation be " resonable " at preceding participle, the splicing of " resonable " is closed: [resonable 1]-end, " resonable " that obtain from this splicing relation is the end of sentence, therefore, the word segmentation result that finally obtains is: he-say-really-resonable.
By above-mentioned word segmentation result as can be seen, the end to end back of the participle in the word segmentation result constitutes " he says really reason really ", and does not have unnecessary word.
If use this device that sentence is carried out participle, then in said apparatus, further comprise:
The rough segmentation unit is used for sentence is carried out word segmentation processing, obtains the branch set of words of described sentence;
Sequencing unit is used for the participle in the described minute set of words being sorted the branch set of words after obtaining sorting according to the position of participle at sentence.
More than a kind of Chinese word segmentation method provided by the present invention and device are described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1, a kind of Chinese word segmentation method is characterized in that, comprising:
Be that the participle in minute set of words composes weights, the participle in described minute set of words is according to the position ordering of participle in sentence;
Last participle from described minute set of words, write down the weights of current participle and its preceding participle to the distance of sentence tail and, as the distance of current participle to the sentence tail, mark should be in the splicing relation of preceding participle and current participle, first participle reaches described first participle and its splicing relation at preceding participle to the distance of sentence tail in obtaining the branch set of words; Wherein, described at preceding participle be all of described current participle in preceding participle to the distance of sentence tail the shortest at preceding participle;
From first participle of described minute set of words, select the described the shortest beginning of the sentence participle of distance to the sentence tail, first word of described beginning of the sentence participle is first word of described sentence;
From described beginning of the sentence participle, according to described splicing relation, obtain successively institute's mark in the splicing relation at preceding participle, finish up to sentence.
2, the method for claim 1 is characterized in that, described method further comprises:
Sentence is carried out word segmentation processing, obtain the branch set of words of described sentence;
According to the position of participle in sentence, the participle in the described minute set of words is sorted the branch set of words after obtaining sorting.
3, method as claimed in claim 1 or 2 is characterized in that, if at least two of current participle are equal and the shortest to the distance of sentence tail at preceding participle, then
From wherein be chosen in the branch set of words sorting position forward at preceding participle, be this current participle and described in splicing relation of the preceding mark of word segmentation.
4, method as claimed in claim 3 is characterized in that, if having at least two beginning of the sentence participles equal and the shortest to the distance of sentence tail;
The forward beginning of the sentence participle of sorting position from wherein be chosen in the branch set of words.
5, method as claimed in claim 4 is characterized in that, the weights that each participle in the branch set of words is composed equate or be unequal.
6, a kind of Chinese word segmentation device is characterized in that, comprising:
The assignment unit is used to the participle in the branch set of words to compose weights, and the participle in described minute set of words is according to the position ordering of participle in sentence;
Record cell, be used for last participle from described ordering, write down the weights of current participle and its preceding participle to the distance of sentence tail and, as the distance of current participle to the sentence tail, mark should be in the splicing relation of preceding participle and current participle, first participle reaches described first participle and its splicing relation at preceding participle to the distance of sentence tail in obtaining the branch set of words; Wherein, described at preceding participle be all of described current participle in preceding participle to the distance of sentence tail the shortest at preceding participle;
Beginning of the sentence participle selected cell is used for first participle from minute set of words, selects the described the shortest beginning of the sentence participle of distance to the sentence tail, and first word of described beginning of the sentence participle is first word of described sentence;
The participle selected cell is used for from described beginning of the sentence participle, according to described splicing relation, obtain successively institute's mark in the splicing relation at preceding participle, finish up to sentence.
7, device as claimed in claim 6 is characterized in that, described device further comprises:
Sentence rough segmentation unit is used for sentence is carried out word segmentation processing, obtains the branch set of words of described sentence;
Sequencing unit is used for the participle in the described minute set of words being sorted the branch set of words after obtaining sorting according to the position of participle at sentence.
8, as claim 6 or 7 described devices, it is characterized in that, if at least two of current participle are equal and the shortest to the distance of sentence tail at preceding participle, then
Described record cell, be used for from wherein be chosen in branch set of words sorting position forward at preceding participle, be this current participle and described in splicing relation of the preceding mark of word segmentation.
9, device as claimed in claim 8 is characterized in that, if having at least two beginning of the sentence participles equal and the shortest to the distance of sentence tail, then
Described beginning of the sentence participle selected cell is used for from wherein selecting the forward beginning of the sentence participle of branch set of words sorting position.
10, device as claimed in claim 9 is characterized in that, the weights that described assignment unit is composed for each participle in the branch set of words equate or be unequal.
CN 200710102082 2007-05-14 2007-05-14 Chinese word cutting method and device Pending CN101071421A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200710102082 CN101071421A (en) 2007-05-14 2007-05-14 Chinese word cutting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200710102082 CN101071421A (en) 2007-05-14 2007-05-14 Chinese word cutting method and device

Publications (1)

Publication Number Publication Date
CN101071421A true CN101071421A (en) 2007-11-14

Family

ID=38898648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200710102082 Pending CN101071421A (en) 2007-05-14 2007-05-14 Chinese word cutting method and device

Country Status (1)

Country Link
CN (1) CN101071421A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158969B (en) * 2007-11-23 2010-06-02 腾讯科技(深圳)有限公司 Whole sentence generating method and device
CN101246473B (en) * 2008-03-28 2010-09-15 腾讯科技(深圳)有限公司 Segmentation system evaluating method and segmentation evaluating system
CN104252542A (en) * 2014-09-29 2014-12-31 南京航空航天大学 Dynamic-planning Chinese words segmentation method based on lexicons
CN108364632A (en) * 2017-12-22 2018-08-03 东南大学 A kind of Chinese text voice synthetic method having emotion
CN108664468A (en) * 2018-05-02 2018-10-16 武汉烽火普天信息技术有限公司 A kind of name recognition methods and device based on dictionary and semantic disambiguation
CN111353281A (en) * 2020-02-24 2020-06-30 百度在线网络技术(北京)有限公司 Text conversion method and device, electronic equipment and storage medium
CN112784577A (en) * 2021-01-26 2021-05-11 鲁巧巧 Sentence association learning system for English teaching

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158969B (en) * 2007-11-23 2010-06-02 腾讯科技(深圳)有限公司 Whole sentence generating method and device
CN101246473B (en) * 2008-03-28 2010-09-15 腾讯科技(深圳)有限公司 Segmentation system evaluating method and segmentation evaluating system
CN104252542A (en) * 2014-09-29 2014-12-31 南京航空航天大学 Dynamic-planning Chinese words segmentation method based on lexicons
CN108364632A (en) * 2017-12-22 2018-08-03 东南大学 A kind of Chinese text voice synthetic method having emotion
CN108364632B (en) * 2017-12-22 2021-09-10 东南大学 Emotional Chinese text voice synthesis method
CN108664468A (en) * 2018-05-02 2018-10-16 武汉烽火普天信息技术有限公司 A kind of name recognition methods and device based on dictionary and semantic disambiguation
CN111353281A (en) * 2020-02-24 2020-06-30 百度在线网络技术(北京)有限公司 Text conversion method and device, electronic equipment and storage medium
CN112784577A (en) * 2021-01-26 2021-05-11 鲁巧巧 Sentence association learning system for English teaching
CN112784577B (en) * 2021-01-26 2022-11-18 鲁巧巧 Sentence association learning system for English teaching

Similar Documents

Publication Publication Date Title
CN101071421A (en) Chinese word cutting method and device
CN105869642B (en) A kind of error correction method and device of speech text
CN106407236B (en) A kind of emotion tendency detection method towards comment data
CN107679032A (en) Voice changes error correction method and device
CN108614898A (en) Document method and device for analyzing
CN105718586A (en) Word division method and device
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
CN109522406A (en) Text semantic matching process, device, computer equipment and storage medium
CN105653517A (en) Recognition rate determining method and apparatus
CN102214166A (en) Machine translation system and machine translation method based on syntactic analysis and hierarchical model
CN103646018A (en) Chinese word segmentation method based on hash table dictionary structure
CN107526826A (en) Phonetic search processing method, device and server
CN107203265A (en) Information interacting method and device
CN111368544A (en) Named entity identification method and device
CN111274814A (en) Novel semi-supervised text entity information extraction method
CN109684928A (en) Chinese document recognition methods based on Internal retrieval
CN109766881A (en) A kind of character identifying method and device of vertical text image
CN106484677A (en) A kind of Chinese fast word segmentation system and method based on minimal information amount
CN108491381A (en) A kind of syntactic analysis method of Chinese bipartite structure
CN105070289A (en) English name recognition method and device
CN107451433A (en) A kind of information source identification method and apparatus based on content of text
CN114547232A (en) Nested entity identification method and system with low labeling cost
CN117473984A (en) Method and system for dividing txt document content chapters
US8626688B2 (en) Pattern matching device and method using non-deterministic finite automaton
CN102023854A (en) Template-based semantic variable extraction method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20071114