CN101082908A - Method and system for dividing Chinese sentences - Google Patents
Method and system for dividing Chinese sentences Download PDFInfo
- Publication number
- CN101082908A CN101082908A CN 200710076131 CN200710076131A CN101082908A CN 101082908 A CN101082908 A CN 101082908A CN 200710076131 CN200710076131 CN 200710076131 CN 200710076131 A CN200710076131 A CN 200710076131A CN 101082908 A CN101082908 A CN 101082908A
- Authority
- CN
- China
- Prior art keywords
- word
- dictionary
- chinese
- participle
- segmenting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Character Discrimination (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a Chinese word-dividing method and system in the Chinese medicine information processing domain, which comprises the following steps: A. doing atom cutting for the input Chinese test; building initial cutting word pattern of atom sequence; B. cutting the dictionary and specific word identification based on atom sequence; adding respectively individual word-dividing result into the cutting word pattern; C. generating an optimum word-cutting path according to the word-cutting result in the cutting word pattern; outputting the synthetic word-cutting result according to the optimum word-dividing path. The invention improves the accuracy of Chinese word-dividing with high efficiency, which can identify each kind of specific word selectively according to specific condition.
Description
Technical field
The present invention relates to the Chinese information processing field, more particularly, relate to a kind of Chinese word segmentation method and system.
Background technology
Chinese information processing technology has now obtained widespread use in computer realms such as computer network, database technology, soft projects, and Chinese Automatic Word Segmentation is an important basic work of Chinese information processing, all relate to the participle problem in many Chinese information processing projects, as mechanical translation, automatic abstract, classification automatically, the full-text search of Chinese literature storehouse, search engine etc.Because Chinese text is write the two or more syllables of a word together, does not have the space between the speech, thereby in Chinese text was handled, the problem that at first runs into was the problem of participle, and the correct cutting of speech is to carry out the necessary condition that Chinese text is handled.
The Chinese word segmentation algorithm can be divided into three major types, promptly based on the segmenting method of string matching, based on the segmenting method of understanding with based on the segmenting method of statistics: (1) is called mechanical segmentation method again based on the segmenting method of string matching, it is according to certain strategy the entry in the abundant big machine dictionary of Chinese character string to be analyzed and to be mated, if in dictionary, find certain character string, then the match is successful, thereby identify a speech.(2) be meant based on the segmenting method of understanding, carry out sentence structure, semantic analysis in participle, utilize syntactic information and semantic information to handle the ambiguity phenomenon, it has simulated the understanding process of people to sentence, needs to use a large amount of linguistry and information.Because it is general, the complicacy of Chinese language knowledge are difficult to various language messages are organized into the form that machine can directly read, therefore ripe not enough based on the segmenting method of understanding at present.(3) theoretical foundation of segmenting method based on statistics is, is the combination of stable word from see speech in form, and the number of times that therefore adjacent word occurs simultaneously in context is many more, just might constitute a speech more.Therefore can add up the frequency of the combination of each word of adjacent co-occurrence in the text, promptly calculate the adjacent co-occurrence probabilities between word and word in the word group, when this probability is higher than some threshold values, can think that just this word group may constitute a speech.The speech that the statistics segmenting method cuts out all has probabilistic information, at last by in all possible cutting result, selecting a kind of word segmentation result of probability maximum, this method has the advantage of automatic disambiguation, and present this method is the main stream approach of participle.
In a kind of existing segmenting method, concrete steps are as follows: A. is single atom with the input text cutting; B. from the atomic series that cutting obtains, identify dictionary and included vocabulary, abbreviate dictionary word as; C. utilize the common ambiguity word segmentation table of preserving in the system, the dictionary word that identifies is carried out cutting row fork; D. arrange on the basis of fork in cutting then, further identification has regular specific word (as time word, number, name, place name etc.), exports word segmentation result at last.
There is following drawback in the method for above-mentioned prior art: the cutting row's fork and the specific word identification of dictionary word cutting, ambiguity speech are several separate stages, the mistake that can cause occur previous stage like this will conduct to the latter half, can't in time revise the word segmentation result mistake that makes final output.For example, if the sentence of initial input is that " Wang Fang is that research is biological.", through the result after dictionary word cutting (can adopt the forward maximum matching algorithm) be " king/virtue/be/postgraduate/thing//.", the cutting mistake of " postgraduate/thing " has appearred as can be known; The cutting row fork stage of ambiguity speech mainly is the common ambiguity word segmentation table that utilizes system to preserve, and there is a critical defect in common ambiguity word segmentation table, it does not contain all language ambiguity phenomenons forever, do not have favorable expansibility, if do not comprise in this embodiment " research is biological " this, the cutting mistake of " postgraduate/thing " just can not be repaired so; In next stage, utilize time number table, name surname table and place name suffix table to carry out specific word identification, the word segmentation result of final output then for " Wang Fang/be/postgraduate/thing//.", this result has still continued the mistake of dictionary word cutting stage appearance.
Therefore need a kind of new Chinese word segmentation method, improve the accuracy of Chinese word segmentation.
Summary of the invention
The object of the present invention is to provide a kind of Chinese automatic word-cut, be intended to solve the lower problem of existing Chinese word segmentation method accuracy.
The present invention also aims to provide a kind of Chinese word segmentation method, to solve the above-mentioned problems in the prior art better.
In order to realize goal of the invention, described Chinese automatic word-cut comprises input-output unit, atom cutting unit, dictionary word cutting unit and specific word recognition unit, and described system also comprises a segmenting word figure unit and a participle path generation unit;
Described segmenting word figure unit links to each other with atom cutting unit, dictionary word cutting unit and specific word recognition unit, is used for the unattached participle result of atom cutting unit, dictionary word cutting unit and specific word recognition unit is kept at segmenting word figure respectively;
Described participle path generation unit links to each other with segmenting word figure unit, is used for generating an optimum participle path according to each unattached participle result of segmenting word figure, and exports comprehensive word segmentation result according to described optimum participle path.
Preferably, described specific word recognition unit comprises time number identification module, name identification module, place name identification module;
Described time number identification module has a time number table, is used for recognition time speech and number, and described time word and number are saved among the segmenting word figure;
Described name identification module has a name surname table, is used to discern name, and described name is saved among the segmenting word figure;
Described place name identification module has a place name suffix table, is used to discern place name, and described place name is saved among the segmenting word figure.
Preferably, described specific word recognition unit further comprises a startup configuration module that links to each other respectively with described time number identification module, name identification module, place name identification module;
Described startup configuration module is used for described time number identification module, name identification module, place name identification module are selectively started.
Preferably, described participle path generation unit is further used for the probabilistic information according to described each unattached participle result, calculates the speech arc probability of each node among the described segmenting word figure, and with the participle path of speech arc probability product maximum as optimum participle path.
In order to realize goal of the invention better, described Chinese word segmentation method may further comprise the steps based on aforesaid Chinese automatic word-cut:
A. the Chinese text to input carries out the atom cutting, and sets up initial segmenting word figure according to the atomic series of gained;
B. carry out the identification of dictionary word cutting and specific word respectively based on atomic series, and unattached participle result is separately added among the described segmenting word figure;
C. generate an optimum participle path according to each the unattached participle result among the described segmenting word figure, and export comprehensive word segmentation result according to described optimum participle path.
Preferably, also comprise before the described steps A: in described Chinese automatic word-cut, deposit dictionary and specific vocabulary in;
Described dictionary is included common wordss;
Described specific vocabulary comprises time number table, name surname table, place name suffix table.
Preferably, the step of carrying out the dictionary word cutting among the described step B comprises, takes the forward maximum matching algorithm that the vocabulary of including in described atomic series and the dictionary is compared, and the vocabulary that is mated is defined as dictionary word.
Preferably, the identification of the specific word among the described step B comprises at least one in following three generic operations of execution:
Utilize described time number table recognition time speech and number;
Utilize described name surname table identification name;
Utilize described place name suffix table identification place name.
Preferably, described step C further comprises: according to described each unattached participle result's probabilistic information, calculate the speech arc probability of each node among the described segmenting word figure, and with the participle path of speech arc probability product maximum as optimum participle path.
Preferably, described unattached participle result's probabilistic information comprises the probabilistic information of dictionary word and the probabilistic information of specific word;
The probabilistic information of described dictionary word is meant the one-tenth Word probability that vocabulary that dictionary is included has;
The probabilistic information of described specific word is meant initial probability, emission probability and the transition probability in the hidden Markov model.
The atomic series that the present invention is based on Chinese text carries out the identification of dictionary word cutting and specific word respectively, generate an optimum participle path according to separately unattached participle result, finally export comprehensive word segmentation result, thereby improved the accuracy of Chinese word segmentation according to this optimum participle path.In addition, in specific word identification, start identification as the case may be selectively, improved the efficient of Chinese word segmentation all kinds of specific word.
Description of drawings
Fig. 1 is a Chinese automatic word-cut structural drawing of the present invention;
Fig. 2 is the system shown in Figure 1 cut-away view of specific word recognition unit in one embodiment;
Fig. 3 is the system shown in Figure 1 cut-away view of specific word recognition unit in another embodiment;
Fig. 4 is a Chinese word segmentation method flow diagram of the present invention;
Fig. 5 is the Chinese word segmentation method flow diagram in the one embodiment of the invention;
Fig. 6 is the synoptic diagram that carries out the segmenting word figure after the atom cutting in one embodiment of the present of invention;
Fig. 7 is the synoptic diagram that carries out the segmenting word figure after specific word is discerned in one embodiment of the present of invention.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and Examples.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
The present invention carries out the atom cutting by the Chinese text with input, and carry out dictionary word cutting and specific word respectively based on atomic series and discern, unattached participle result is separately added among the segmenting word figure, generate an optimum participle path according to each the unattached participle result among the segmenting word figure again, finally export comprehensive word segmentation result according to this optimum participle path.Because technical scheme of the present invention is taken all factors into consideration the processing in each stage, the mistake when having avoided carrying out is in proper order conducted, thereby has improved the accuracy of Chinese word segmentation.
Fig. 1 shows the structure of Chinese automatic word-cut of the present invention, this system can be applicable in the multiple terminal device, personal computer (Personal Computer for example, PC), personal digital assistant (Personal DigitalAssistant, PDA), mobile phone (Mobile Phone, MP) etc., therefore Chinese automatic word-cut of the present invention should not limit the terminal device that is applied to certain particular type.
This Chinese automatic word-cut comprises input-output unit 100, atom cutting unit 200, dictionary word cutting unit 300, specific word recognition unit 400, segmenting word figure unit 500 and participle path generation unit 600, carries out information interaction between each functional unit.Should be noted that the annexation between each equipment is the needs of explaining its information interaction and control procedure for clear in all diagrams of the present invention, therefore should be considered as annexation in logic, and should not only limit to physical connection.Wherein:
(1) the following function of input-output unit 100 main execution: import original Chinese text, and export final word segmentation result.
(2) atom cutting unit 200 links to each other with input-output unit 100, and link to each other respectively with dictionary word cutting unit 300, specific word recognition unit 400 and segmenting word figure unit 500, be used for the original Chinese text that input-output unit 100 is imported is carried out the atom cutting, obtain atomic series, and set up initial segmenting word figure according to atomic series, be saved in the segmenting word figure unit 500.About atom alleged among the present invention, make following explanation: each Chinese sentence (comprising word, phrase, complete statement etc.) all comprises a plurality of nodes, and the character string between per two nodes is an atom.For example, if the original Chinese text of input is for " he is a teacher.", the text comprises 8 nodes (with symbol " ● " expression), 7 atoms so, and the result after the atom cutting is so: " ● he ● be ● one ● name ● religion ● teacher ●.● ", the initial segmenting word figure that sets up in the segmenting word figure unit 500 just.
Among this segmenting word figure, have a speech arc (as shown in Figure 6) between per two nodes, every speech arc all has probabilistic information.For dictionary word, its probabilistic information is meant the one-tenth Word probability that vocabulary that dictionary is included has; For specific word, its probabilistic information refers to hidden Markov model (Hidden Markov Model, HMM) the initial probability in, emission probability and transition probability information.
(3) dictionary word cutting unit 300 links to each other with atom cutting unit 200 and segmenting word figure unit 500, it stores dictionary (having included a large amount of common wordss), thereby carry out based on atom cutting unit 200 that the atomic series of gained carries out the dictionary word cutting after the atom cutting, identify the speech that all are included in dictionary, and add among the segmenting word figure.
In an exemplary scenario, dictionary word cutting unit 300 takes the forward maximum matching algorithm to carry out the dictionary word cutting.Detailed process comprises: at first set a maximum length N (as 10 Chinese characters) who searches speech, subordinate clause first opens the beginning and scans backward then, compares with the vocabulary of including in the dictionary, thereby finds a longest speech of length; A word continues to search after this speech then, repeats above process up to the sentence tail.For example, " Wang Fang is that research is biological to sentence." forward maximum match word segmentation result for " king/virtue/be/postgraduate/thing//.”
(4) specific word recognition unit 400 links to each other with atom cutting unit 200 and segmenting word figure unit 500, is used to identify various specific word, comprises time word, number, name, place name etc.Its inside has a plurality of specific vocabularys accordingly, thereby compares with atomic series, identifies above-mentioned all kinds of specific word.
In an exemplary scenario, as shown in Figure 2, this specific word recognition unit 400 further comprises time number identification module 401, name identification module 402, place name identification module 403.Wherein: there is a time number table (1) time number identification module 401 inside, compare with time word and number in the atomic series; (2) name identification module 402 has a name surname table, with a Chinese character role probability tables (representing the probability of Chinese character) as each part of unregistered word in the dictionary, with " Wang Fang be research biological " this Chinese text is example: subordinate clause first opens the beginning and scans backward, when running into " king " word, by checking name surname table, find that this is a surname, then check two words in " king " word back, if each word as the probability of name all greater than a threshold value, think that then this is a name, in this example " virtue " word as the probability of name greater than threshold value, and "Yes" is lower than threshold value as the probability of name, then identifying " Wang Fang " is a name: (3) place name identification module 403 is similar with name identification module 402, and it has a place name suffix table and a Chinese character role probability tables, identifies place name in the same way.Certainly, the present invention also can discern the specific word of other kinds, so specific word recognition unit 400 is not limited to comprise above several modules.
In another exemplary scenario, as shown in Figure 3, this specific word recognition unit 400 removes and comprises time number identification module 401, name identification module 402, place name identification module 403, comprise that also starts a configuration module 404, link to each other respectively with aforesaid three modules, be used for as the case may be, time number identification module 401, name identification module 402, place name identification module 403 are selectively started.Because in actual conditions, not necessarily to carry out the specific word identification of all kinds, can improve Chinese word segmentation efficient if selectively discern then.In an embodiment of this exemplary scenario, its specific implementation process is: at first be provided with one be used for initial phase operation configuration file:
<?xml?version=″1.0″encoding=″GB2312″?>
<TseSegment>
<!--whether carry out time number identification, 1 for being that 0 for denying--〉
<NumTime>1</NumTime>
<!--whether carry out name identification, 1 for being that 0 for denying--〉
<Person>1</Person>
<!--whether carry out place name identification, 1 for being, 0 for not--〉
<Location>1</Location>
</TseSegment>
Have three configuration item NumTime, Person, Location in this section configuration file, represent time number, name and place name respectively, can increase a global variable separately and need indicate whether certain module, as follows:
// deploy switch
boolg_bIsNumTime;
boolg_bIsPerson;
boolg_bIsLocation。
Thereby when program initialization, these several variablees are carried out assignment, then in the participle process respectively the value to these several variablees judge: if value is 1, then carry out the operation of corresponding module, otherwise just ignore this module.
Among the present invention, dictionary word cutting unit 300 and the 400 performed operations of specific word recognition unit are separate, do not have sequencing, can parallel processing, and obtain unattached participle result separately, and send in the segmenting word figure unit 500.
(5) segmenting word figure unit 500 links to each other respectively with atom cutting unit 200, dictionary word cutting unit 300 and specific word recognition unit 400, is used for the unattached participle result of atom cutting unit 200, dictionary word cutting unit 300 and specific word recognition unit 400 is kept at segmenting word figure respectively.
(6) participle path generation unit 600 links to each other with segmenting word figure unit 500, is used for generating an optimum participle path according to each unattached participle result of segmenting word figure, and exports comprehensive word segmentation result according to optimum participle path.
In an exemplary scenario, optimum participle path is meant the participle path of speech arc probability product maximum.As shown in Figure 7, details are as follows to generate the process in optimum participle path:
Have 7 nodes among the segmenting word figure of Fig. 7, sequence number is made as 0 to 6.From left to right scan all nodes, the probable value of establishing current node is m, and the most probable value of forward direction node is a, and the probable value of the arc that forward direction node and current node are formed is b, so the probable value m=a*b of current node; Relatively these probability that calculate keep the probability of a maximum and the forward direction node of correspondence.The circulation said process, when handling last node, each node has all been preserved the information of its forward direction node, recalls forward from last node like this, just can generate the participle path of an optimum.For example, if arrived node 5 when pre-treatment, its forward direction node has node 2 and node 4, therefore 5 two participle paths is arranged from node 0 to node: (1) " node 0-node 2-node 5 ", and wherein node 2 is represented speech " Zhang Huipeng " to the arc of node 5; (2) " node 0-node 4-node 5 ", node 4 is " rocs " to the arc of node 5.Probability and speech arc probability according to each speech can calculate and learn, node 0 is greater than node 0 multiply by " roc " to the probability of node 4 probability to the probability that the maximum probability of node 2 multiply by " Zhang Huipeng ".So node 0 is in the maximum probability participle path of node 5, the forward direction node of node 5 should be 2.
Fig. 4 shows Chinese word segmentation method flow of the present invention, and this method flow is based on Fig. 1, Fig. 2, system architecture shown in Figure 3, and detailed process is as follows:
Carry out institute of the present invention in steps before, have dictionary in the dictionary word cutting unit 300, included a large amount of common wordss.There are a plurality of specific vocabularys in the specific word recognition unit 400, comprise time number table, name surname table, place name suffix table and Chinese character role probability tables etc.
In step S401, the Chinese text of the 200 pairs of inputs in atom cutting unit carries out the atom cutting, obtains atomic series, and sets up initial segmenting word figure according to atomic series, is saved in the segmenting word figure unit 500.About atom alleged among the present invention, make following explanation: each Chinese sentence (comprising word, phrase, complete statement etc.) all comprises a plurality of nodes, and the character string between per two nodes is an atom.For example, if the original Chinese text of input is for " he is a teacher.", the text comprises 8 nodes (with symbol " ● " expression), 7 atoms so, and the result after the atom cutting is so: " ● he ● be ● one ● name ● religion ● teacher ●.● ", the initial segmenting word figure that sets up in the segmenting word figure unit 500 just.
Among this segmenting word figure, have a speech arc (as shown in Figure 6) between per two nodes, every speech arc all has probabilistic information.For dictionary word, its probabilistic information is the probabilistic information of the speech that has in the dictionary; For specific word, its probabilistic information refers to hidden Markov model (Hidden Markov Model, HMM) the initial probability in, emission probability and transition probability information.
In step S402, dictionary word cutting unit 300 and specific word recognition unit 400 carry out the identification of dictionary word cutting and specific word respectively based on atomic series, and unattached participle result is separately added among the described segmenting word figure.Should be noted that in the present invention dictionary word cutting unit 300 and the 400 performed operations of specific word recognition unit are separate, do not have sequencing, can parallel processing, obtain unattached participle result separately, and send in the segmenting word figure unit 500.
Dictionary word cutting unit 300 stores the dictionary of having included a large amount of common wordss, thereby can carry out the dictionary word cutting based on atomic series, identifies the speech that all are included in dictionary, and adds among the segmenting word figure.
There are a plurality of specific vocabularys specific word recognition unit 400 inside, thereby compare with atomic series, identify all kinds of specific word such as time word, number, name, place name.
In step S403, generate an optimum participle path according to each the unattached participle result among the described segmenting word figure, and export comprehensive word segmentation result according to described optimum participle path.
In an exemplary scenario of above-mentioned steps, the probabilistic information that participle path generation unit 600 is used for according to each unattached participle result, calculate the speech arc probability of each node among the segmenting word figure, and with the participle path of speech arc probability product maximum as optimum participle path.Alleged unattached participle result's probabilistic information among the present invention, comprise the probabilistic information of dictionary word and the probabilistic information of specific word, wherein, the probabilistic information of dictionary word is meant the one-tenth Word probability that vocabulary that dictionary is included has, and the probabilistic information of specific word is meant initial probability, emission probability and the transition probability in the hidden Markov model.In an embodiment of above-mentioned exemplary scenario, as shown in Figure 7, have 7 nodes among the segmenting word figure, sequence number is made as 0 to 6.From left to right scan all nodes, the probable value of establishing current node is m, and the most probable value of forward direction node is a, and the probable value of the arc that forward direction node and current node are formed is b, so the probable value m=a*b of current node; Relatively these probability that calculate keep the probability of a maximum and the forward direction node of correspondence.The circulation said process, when handling last node, each node has all been preserved the information of its forward direction node, recalls forward from last node like this, just can generate the participle path of an optimum.
Fig. 5 shows the Chinese word segmentation method flow of one embodiment of the present of invention, and this method flow is based on figure Fig. 1, Fig. 2, system architecture shown in Figure 3, and detailed process is as follows:
Carry out institute of the present invention in steps before, have dictionary in the dictionary word cutting unit 300, included a large amount of common wordss.There are a plurality of specific vocabularys in the specific word recognition unit 400, comprise time number table, name surname table, place name suffix table and Chinese character role probability tables etc.
In step S501, utilize the original Chinese text of input-output unit 100 inputs.
In step S502, utilize the Chinese text of the 200 pairs of inputs in atom cutting unit to carry out the atom cutting, obtain atomic series, and set up initial segmenting word figure according to atomic series, be saved in the segmenting word figure unit 500.About atom alleged among the present invention, make following explanation: each Chinese sentence (comprising word, phrase, complete statement etc.) all comprises a plurality of nodes, and the character string between per two nodes is an atom.For example, if the original Chinese text of input is for " he is a teacher.", the text comprises 8 nodes (with symbol " ● " expression), 7 atoms so, and the result after the atom cutting is so: " ● he ● be ● one ● name ● religion ● teacher ●.● ", the initial segmenting word figure that sets up in the segmenting word figure unit 500 just.
Among this segmenting word figure, have a speech arc (as shown in Figure 6) between per two nodes, every speech arc all has probabilistic information.For dictionary word, its probabilistic information is the probabilistic information of the speech that has in the dictionary; For specific word, its probabilistic information refers to hidden Markov model (Hidden Markov Model, HMM) the initial probability in, emission probability and transition probability information.
In step S503, dictionary word cutting unit 300 identifies dictionary word in the text based on atomic series, and adds among the segmenting word figure.Because dictionary word cutting unit 300 stores the dictionary of having included a large amount of common wordss, thereby can carry out the dictionary word cutting based on atomic series, identifies the speech that all are included in dictionary, and adds among the segmenting word figure.Its specific implementation process is similar to prior art.
In an exemplary scenario, dictionary word cutting unit 300 takes the forward maximum matching algorithm to carry out the dictionary word cutting.The detailed process of the forward maximum matching algorithm that the present invention is alleged comprises: at first set a maximum length N (as 10 Chinese characters) who searches speech, subordinate clause first opens the beginning and scans backward then, compare with the vocabulary of including in the dictionary, thereby find a longest speech of length; A word continues to search after this speech then, repeats above process up to the sentence tail.For example, " Wang Fang is that research is biological to sentence." forward maximum match word segmentation result for " king/virtue/be/postgraduate/thing//.”
In step S504, specific word recognition unit 400 identifies specific word in the text based on atomic series, and adds among the segmenting word figure.Because there are a plurality of specific vocabularys specific word recognition unit 400 inside, thereby compare with atomic series, identify all kinds of specific word such as time word, number, name, place name.
In an exemplary scenario, above-mentioned steps is based on specific word recognition unit 400 shown in Figure 2, and this specific word recognition unit 400 comprises time number identification module 401, name identification module 402, place name identification module 403.Wherein: there is a time number table (1) time number identification module 401 inside, compare with time word and number in the atomic series; (2) name identification module 402 has a name surname table, with a Chinese character role probability tables (representing the probability of Chinese character) as each part of unregistered word in the dictionary, with " Wang Fang be research biological " this Chinese text is example: subordinate clause first opens the beginning and scans backward, when running into " king " word, by checking name surname table, find that this is a surname, then check two words in " king " word back, if each word as the probability of name all greater than a threshold value, think that then this is a name, in this example " virtue " word as the probability of name greater than threshold value, and "Yes" is lower than threshold value as the probability of name, then identifying " Wang Fang " is a name: (3) place name identification module 403 is similar with name identification module 402, and it has a place name suffix table and a Chinese character role probability tables, identifies place name in the same way.Certainly, the present invention also can discern the specific word of other kinds, so specific word recognition unit 400 is not limited to comprise above several modules.
In another exemplary scenario, above-mentioned steps is based on specific word recognition unit 400 shown in Figure 3, this specific word recognition unit 400 removes and comprises time number identification module 401, name identification module 402, place name identification module 403, comprise that also starts a configuration module 404, link to each other respectively with aforesaid three modules, be used for as the case may be, time number identification module 401, name identification module 402, place name identification module 403 are selectively started.Because in actual conditions, not necessarily to carry out the specific word identification of all kinds, can improve Chinese word segmentation efficient if selectively discern then.In an embodiment of this exemplary scenario, its specific implementation process is: at first be provided with one be used for initial phase operation configuration file:
<?xml?version=″1.0″encoding=″GB2312″?>
<TseSegment>
<!--whether carry out time number identification, 1 for being that 0 for denying--〉
<NumTime>1</NumTime>
<!--whether carry out name identification, 1 for being that 0 for denying--〉
<Person>1</Person>
<!--whether carry out place name identification, 1 for being, 0 for not--〉
<Location>1</Location>
</TseSegment>
Have three configuration item NumTime, Person, Location in this section configuration file, represent time number, name and place name respectively, can increase a global variable separately and need indicate whether certain module, as follows:
// deploy switch
boolg_bIsNumTime;
boolg_bIsPerson;
boolg_bIsLocation。
Thereby when program initialization, these several variablees are carried out assignment, then in the participle process respectively the value to these several variablees judge: if value is 1, then carry out the operation of corresponding module, otherwise just ignore this module.
In step S505, participle path generation unit 600 generates the participle path of an optimum according to segmenting word figure.
In an exemplary scenario of above-mentioned steps, the probabilistic information that participle path generation unit 600 is used for according to each unattached participle result, calculate the speech arc probability of each node among the segmenting word figure, and with the participle path of speech arc probability product maximum as optimum participle path.Alleged unattached participle result's probabilistic information among the present invention, comprise the probabilistic information of dictionary word and the probabilistic information of specific word, wherein, the probabilistic information of dictionary word is meant the one-tenth Word probability that vocabulary that dictionary is included has, and the probabilistic information of specific word is meant initial probability, emission probability and the transition probability in the hidden Markov model.In an embodiment of above-mentioned exemplary scenario, as shown in Figure 7, have 7 nodes among the segmenting word figure, sequence number is made as 0 to 6.From left to right scan all nodes, the probable value of establishing current node is m, and the most probable value of forward direction node is a, and the probable value of the arc that forward direction node and current node are formed is b, so the probable value m=a*b of current node; Relatively these probability that calculate keep the probability of a maximum and the forward direction node of correspondence.The circulation said process, when handling last node, each node has all been preserved the information of its forward direction node, recalls forward from last node like this, just can generate the participle path of an optimum.
In step S506, input-output unit 100 in the foregoing embodiments, is that " I am Zhang Huipeng as if the initial original Chinese text of importing according to the participle path output word segmentation result of optimum.", the word segmentation result according to participle path output optimum among Fig. 7 is so: " I/be/Zhang Huipeng.”。
The above only is preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.
Claims (10)
1, a kind of Chinese automatic word-cut comprises input-output unit, atom cutting unit, dictionary word cutting unit and specific word recognition unit, it is characterized in that, described system also comprises a segmenting word figure unit and a participle path generation unit;
Described segmenting word figure unit links to each other with atom cutting unit, dictionary word cutting unit and specific word recognition unit, is used for the unattached participle result of atom cutting unit, dictionary word cutting unit and specific word recognition unit is kept at segmenting word figure respectively;
Described participle path generation unit links to each other with segmenting word figure unit, is used for generating an optimum participle path according to each unattached participle result of segmenting word figure, and exports comprehensive word segmentation result according to described optimum participle path.
2, Chinese automatic word-cut according to claim 1 is characterized in that, described specific word recognition unit comprises time number identification module, name identification module, place name identification module;
Described time number identification module has a time number table, is used for recognition time speech and number, and described time word and number are saved among the segmenting word figure;
Described name identification module has a name surname table, is used to discern name, and described name is saved among the segmenting word figure;
Described place name identification module has a place name suffix table, is used to discern place name, and described place name is saved among the segmenting word figure.
3, Chinese automatic word-cut according to claim 2 is characterized in that, described specific word recognition unit further comprises a startup configuration module that links to each other respectively with described time number identification module, name identification module, place name identification module;
Described startup configuration module is used for described time number identification module, name identification module, place name identification module are selectively started.
4, Chinese automatic word-cut according to claim 1, it is characterized in that, described participle path generation unit is further used for the probabilistic information according to described each unattached participle result, calculate the speech arc probability of each node among the described segmenting word figure, and with the participle path of speech arc probability product maximum as optimum participle path.
5, a kind ofly it is characterized in that, said method comprising the steps of based on the Chinese word segmentation method of Chinese automatic word-cut according to claim 1:
A. the Chinese text to input carries out the atom cutting, and sets up initial segmenting word figure according to the atomic series of gained;
B. carry out the identification of dictionary word cutting and specific word respectively based on atomic series, and unattached participle result is separately added among the described segmenting word figure;
C. generate an optimum participle path according to each the unattached participle result among the described segmenting word figure, and export comprehensive word segmentation result according to described optimum participle path.
6, Chinese word segmentation method according to claim 5 is characterized in that, also comprises before the described steps A: deposit dictionary and specific vocabulary in described Chinese automatic word-cut;
Described dictionary is included common wordss;
Described specific vocabulary comprises time number table, name surname table, place name suffix table.
7, Chinese word segmentation method according to claim 6, it is characterized in that, the step of carrying out the dictionary word cutting among the described step B comprises, takes the forward maximum matching algorithm that the vocabulary of including in described atomic series and the dictionary is compared, and the vocabulary that is mated is defined as dictionary word.
8, Chinese word segmentation method according to claim 6 is characterized in that, the specific word identification among the described step B comprises at least one in following three generic operations of execution:
Utilize described time number table recognition time speech and number;
Utilize described name surname table identification name;
Utilize described place name suffix table identification place name.
9, according to the described Chinese word segmentation method of arbitrary claim in the claim 5 to 8, it is characterized in that, described step C further comprises: according to described each unattached participle result's probabilistic information, calculate the speech arc probability of each node among the described segmenting word figure, and with the participle path of speech arc probability product maximum as optimum participle path.
10, Chinese word segmentation method according to claim 9 is characterized in that, described unattached participle result's probabilistic information comprises the probabilistic information of dictionary word and the probabilistic information of specific word;
The probabilistic information of described dictionary word is meant the one-tenth Word probability that vocabulary that dictionary is included has;
The probabilistic information of described specific word is meant initial probability, emission probability and the transition probability in the hidden Markov model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200710076131 CN101082908A (en) | 2007-06-26 | 2007-06-26 | Method and system for dividing Chinese sentences |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200710076131 CN101082908A (en) | 2007-06-26 | 2007-06-26 | Method and system for dividing Chinese sentences |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101082908A true CN101082908A (en) | 2007-12-05 |
Family
ID=38912484
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200710076131 Pending CN101082908A (en) | 2007-06-26 | 2007-06-26 | Method and system for dividing Chinese sentences |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101082908A (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033879A (en) * | 2009-09-27 | 2011-04-27 | 腾讯科技(深圳)有限公司 | Method and device for identifying Chinese name |
CN102298585A (en) * | 2010-06-24 | 2011-12-28 | 高德软件有限公司 | Address splitting and level marking method and device |
CN102314415A (en) * | 2010-07-08 | 2012-01-11 | 盛乐信息技术(上海)有限公司 | Discriminant word segmentation system and method using idiom knowledge |
CN102385700A (en) * | 2010-09-01 | 2012-03-21 | 汉王科技股份有限公司 | Off-line handwriting recognizing method and device |
CN102799676A (en) * | 2012-07-18 | 2012-11-28 | 上海语天信息技术有限公司 | Recursive and multilevel Chinese word segmentation method |
CN103559178A (en) * | 2013-05-31 | 2014-02-05 | 武汉中文百科网络有限公司 | System and method for switching between simplified Chinese characters and traditional Chinese characters on Internet |
WO2015035821A1 (en) * | 2013-09-16 | 2015-03-19 | Tencent Technology (Shenzhen) Company Limited | Methods and systems for query segmentation in a search |
CN104462058A (en) * | 2014-10-24 | 2015-03-25 | 腾讯科技(深圳)有限公司 | Character string identification method and device |
CN105373529A (en) * | 2015-10-28 | 2016-03-02 | 甘肃智呈网络科技有限公司 | Intelligent word segmentation method based on hidden Markov model |
CN106507321A (en) * | 2016-11-22 | 2017-03-15 | 新疆农业大学 | The bilingual GSM message breath voice conversion broadcasting system of a kind of dimension, the Chinese |
CN106610947A (en) * | 2016-08-25 | 2017-05-03 | 四川用联信息技术有限公司 | New Chinese automatic word segmentation algorithm |
CN106610936A (en) * | 2016-09-12 | 2017-05-03 | 四川用联信息技术有限公司 | Improved automatic Chinese word segmentation algorithm |
CN107305630A (en) * | 2016-04-25 | 2017-10-31 | 腾讯科技(深圳)有限公司 | Text sequence recognition methods and device |
CN107423288A (en) * | 2017-07-05 | 2017-12-01 | 达而观信息科技(上海)有限公司 | A kind of Chinese automatic word-cut and method based on unsupervised learning |
CN107608965A (en) * | 2017-09-14 | 2018-01-19 | 掌阅科技股份有限公司 | Extracting method, electronic equipment and the storage medium of books the names of protagonists |
CN108197116A (en) * | 2018-01-31 | 2018-06-22 | 天闻数媒科技(北京)有限公司 | A kind of method, apparatus, participle equipment and the storage medium of Chinese text participle |
CN109033085A (en) * | 2018-08-02 | 2018-12-18 | 北京神州泰岳软件股份有限公司 | The segmenting method of Chinese automatic word-cut and Chinese text |
CN109918665A (en) * | 2019-03-05 | 2019-06-21 | 湖北亿咖通科技有限公司 | Segmenting method, device and the electronic equipment of text |
CN110619122A (en) * | 2019-09-19 | 2019-12-27 | 中国联合网络通信集团有限公司 | Word segmentation processing method, device and equipment and computer readable storage medium |
CN110751234A (en) * | 2019-10-09 | 2020-02-04 | 科大讯飞股份有限公司 | OCR recognition error correction method, device and equipment |
CN115759087A (en) * | 2022-11-25 | 2023-03-07 | 成都赛力斯科技有限公司 | Chinese word segmentation method and device and electronic equipment |
-
2007
- 2007-06-26 CN CN 200710076131 patent/CN101082908A/en active Pending
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033879A (en) * | 2009-09-27 | 2011-04-27 | 腾讯科技(深圳)有限公司 | Method and device for identifying Chinese name |
CN102033879B (en) * | 2009-09-27 | 2015-02-18 | 深圳市世纪光速信息技术有限公司 | Method and device for identifying Chinese name |
CN102298585A (en) * | 2010-06-24 | 2011-12-28 | 高德软件有限公司 | Address splitting and level marking method and device |
CN102314415A (en) * | 2010-07-08 | 2012-01-11 | 盛乐信息技术(上海)有限公司 | Discriminant word segmentation system and method using idiom knowledge |
CN102385700A (en) * | 2010-09-01 | 2012-03-21 | 汉王科技股份有限公司 | Off-line handwriting recognizing method and device |
CN102385700B (en) * | 2010-09-01 | 2013-05-29 | 汉王科技股份有限公司 | Off-line handwriting recognizing method and device |
CN102799676A (en) * | 2012-07-18 | 2012-11-28 | 上海语天信息技术有限公司 | Recursive and multilevel Chinese word segmentation method |
CN102799676B (en) * | 2012-07-18 | 2015-02-18 | 上海语天信息技术有限公司 | Recursive and multilevel Chinese word segmentation method |
CN103559178A (en) * | 2013-05-31 | 2014-02-05 | 武汉中文百科网络有限公司 | System and method for switching between simplified Chinese characters and traditional Chinese characters on Internet |
WO2015035821A1 (en) * | 2013-09-16 | 2015-03-19 | Tencent Technology (Shenzhen) Company Limited | Methods and systems for query segmentation in a search |
US10061844B2 (en) | 2013-09-16 | 2018-08-28 | Tencent Technology (Shenzhen) Company Limited | Methods and systems for query segmentation in a search |
US11003700B2 (en) | 2013-09-16 | 2021-05-11 | Tencent Technology (Shenzhen) Company Limited | Methods and systems for query segmentation in a search |
CN104462058A (en) * | 2014-10-24 | 2015-03-25 | 腾讯科技(深圳)有限公司 | Character string identification method and device |
CN104462058B (en) * | 2014-10-24 | 2018-10-02 | 腾讯科技(深圳)有限公司 | Character string identification method and device |
CN105373529A (en) * | 2015-10-28 | 2016-03-02 | 甘肃智呈网络科技有限公司 | Intelligent word segmentation method based on hidden Markov model |
CN105373529B (en) * | 2015-10-28 | 2018-04-20 | 甘肃智呈网络科技有限公司 | A kind of Word Intelligent Segmentation method based on Hidden Markov Model |
CN107305630A (en) * | 2016-04-25 | 2017-10-31 | 腾讯科技(深圳)有限公司 | Text sequence recognition methods and device |
CN107305630B (en) * | 2016-04-25 | 2021-03-19 | 腾讯科技(深圳)有限公司 | Text sequence identification method and device |
CN106610947A (en) * | 2016-08-25 | 2017-05-03 | 四川用联信息技术有限公司 | New Chinese automatic word segmentation algorithm |
CN106610936A (en) * | 2016-09-12 | 2017-05-03 | 四川用联信息技术有限公司 | Improved automatic Chinese word segmentation algorithm |
CN106507321A (en) * | 2016-11-22 | 2017-03-15 | 新疆农业大学 | The bilingual GSM message breath voice conversion broadcasting system of a kind of dimension, the Chinese |
CN107423288A (en) * | 2017-07-05 | 2017-12-01 | 达而观信息科技(上海)有限公司 | A kind of Chinese automatic word-cut and method based on unsupervised learning |
CN107608965B (en) * | 2017-09-14 | 2018-10-19 | 掌阅科技股份有限公司 | Extracting method, electronic equipment and the storage medium of books the names of protagonists |
CN107608965A (en) * | 2017-09-14 | 2018-01-19 | 掌阅科技股份有限公司 | Extracting method, electronic equipment and the storage medium of books the names of protagonists |
CN108197116B (en) * | 2018-01-31 | 2021-05-28 | 天闻数媒科技(北京)有限公司 | Method and device for segmenting Chinese text, segmentation equipment and storage medium |
CN108197116A (en) * | 2018-01-31 | 2018-06-22 | 天闻数媒科技(北京)有限公司 | A kind of method, apparatus, participle equipment and the storage medium of Chinese text participle |
CN109033085A (en) * | 2018-08-02 | 2018-12-18 | 北京神州泰岳软件股份有限公司 | The segmenting method of Chinese automatic word-cut and Chinese text |
CN109918665A (en) * | 2019-03-05 | 2019-06-21 | 湖北亿咖通科技有限公司 | Segmenting method, device and the electronic equipment of text |
CN109918665B (en) * | 2019-03-05 | 2021-11-02 | 湖北亿咖通科技有限公司 | Word segmentation method and device for text and electronic equipment |
CN110619122A (en) * | 2019-09-19 | 2019-12-27 | 中国联合网络通信集团有限公司 | Word segmentation processing method, device and equipment and computer readable storage medium |
CN110619122B (en) * | 2019-09-19 | 2023-08-22 | 中国联合网络通信集团有限公司 | Word segmentation processing method, device, equipment and computer readable storage medium |
CN110751234A (en) * | 2019-10-09 | 2020-02-04 | 科大讯飞股份有限公司 | OCR recognition error correction method, device and equipment |
CN110751234B (en) * | 2019-10-09 | 2024-04-16 | 科大讯飞股份有限公司 | OCR (optical character recognition) error correction method, device and equipment |
CN115759087A (en) * | 2022-11-25 | 2023-03-07 | 成都赛力斯科技有限公司 | Chinese word segmentation method and device and electronic equipment |
CN115759087B (en) * | 2022-11-25 | 2024-02-20 | 重庆赛力斯凤凰智创科技有限公司 | Chinese word segmentation method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101082908A (en) | Method and system for dividing Chinese sentences | |
CN100504851C (en) | Chinese character word distinguishing method and system | |
Dell’Orletta | Ensemble system for Part-of-Speech tagging | |
CN102135814B (en) | A kind of character and word input method and system | |
CN108595696A (en) | A kind of human-computer interaction intelligent answering method and system based on cloud platform | |
CN111444330A (en) | Method, device and equipment for extracting short text keywords and storage medium | |
CN107239512B (en) | A kind of microblogging comment spam recognition methods of combination comment relational network figure | |
CN102999534A (en) | Chinese word segmentation algorithm based on reverse maximum matching | |
EP3483747A1 (en) | Preserving and processing ambiguity in natural language | |
CN107818141A (en) | Incorporate the biomedical event extraction method of structuring key element identification | |
CN101082909A (en) | Method and system for dividing Chinese sentences for recognizing deriving word | |
CN110096599B (en) | Knowledge graph generation method and device | |
CN106383814A (en) | Word segmentation method of English social media short text | |
CN107153469B (en) | Method for searching input data for matching candidate items, database creation method, database creation device and computer program product | |
CN102339294A (en) | Searching method and system for preprocessing keywords | |
CN107256212A (en) | Chinese search word intelligence cutting method | |
CN105814556A (en) | Context sensitive input tools | |
Perez-Cortes et al. | Stochastic error-correcting parsing for OCR post-processing | |
CN105404677A (en) | Tree structure based retrieval method | |
CN106681981A (en) | Chinese part-of-speech tagging method and device | |
CN103729343A (en) | Semantic ambiguity eliminating method based on encyclopedia link co-occurrence | |
JP2009098952A (en) | Information retrieval system | |
Ilievski et al. | Context-enhanced adaptive entity linking | |
CN112560489A (en) | Entity linking method based on Bert | |
Gholami-Dastgerdi et al. | Part of speech tagging using part of speech sequence graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Open date: 20071205 |