CN109992776B

CN109992776B - Chinese word segmentation method

Info

Publication number: CN109992776B
Application number: CN201910231584.6A
Authority: CN
Inventors: 孙晓光; 张海风; 郝玲风
Original assignee: Beijing Bray Tongyun Culture Communication Co ltd
Current assignee: Beijing Bray Tongyun Culture Communication Co ltd
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2023-07-25
Anticipated expiration: 2039-03-26
Also published as: CN109992776A

Abstract

The invention relates to a Chinese word segmentation method, which comprises the following steps: generating word data to be recognized according to the current Chinese character; the current word data to be identified is brought into a Chinese word segmentation tree; when the Chinese word corresponding to the current word data to be recognized exists in the Chinese word segmentation tree, determining whether the next Chinese character of the current Chinese character exists in the text data to be recognized; when the next Chinese character exists in the text data to be recognized, updating the word data to be recognized according to the next Chinese character, and determining whether a Chinese word corresponding to the updated word data to be recognized exists in a Chinese word segmentation tree; when the Chinese word corresponding to the word data to be recognized does not exist in the Chinese word segmentation tree, generating a first recognized word according to one or more Chinese characters of the current Chinese character; when the next Chinese character of the current Chinese character does not exist in the text data to be recognized, first word segmentation result data is generated according to the first recognized words.

Description

Chinese word segmentation method

Technical Field

The invention relates to the technical field of data processing, in particular to a Chinese word segmentation method.

Background

Chinese segmentation refers to the segmentation of a sequence of chinese characters into individual words. Word segmentation is the process of recombining a continuous word sequence into a word sequence according to a certain specification. The words in the Chinese text do not have a delimiter in a form, so that word segmentation cannot be carried out according to the delimiter in the Chinese text, and great difficulty is brought to Chinese word segmentation. Moreover, some sentences may have multiple splitting results, and how to determine the splitting result most conforming to the semantics is the final splitting result, which is also a challenge in the field of Chinese word segmentation at present.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a Chinese word segmentation method, which divides sentences in text data to be identified into a plurality of words with complete word meanings according to the character sequence in the text data to be identified through a Chinese word segmentation tree with a tree-shaped hierarchical structure, and obtains a final word segmentation result according to the weights of the words, so that the obtained word segmentation result is more accurate.

In order to achieve the above object, the present invention provides a method for word segmentation in chinese, the method comprising:

the semantic processing system receives text data to be identified;

extracting Chinese characters in the text data to be recognized according to the character sequence, and generating word data to be recognized according to the current Chinese characters;

the method comprises the steps of bringing current word data to be identified into a Chinese word segmentation tree, and determining whether a Chinese word corresponding to the current word data to be identified exists in the Chinese word segmentation tree;

when the Chinese word corresponding to the current word data to be recognized exists in the Chinese word segmentation tree, determining whether the next Chinese character of the current Chinese character exists in the text data to be recognized according to the character sequence of the current Chinese character in the text data to be recognized;

extracting the next Chinese character in the text data to be recognized according to the character sequence when the next Chinese character exists in the text data to be recognized;

updating the word data to be identified according to the next Chinese character, and determining whether a Chinese word corresponding to the updated word data to be identified exists in the Chinese word segmentation tree;

when the Chinese word corresponding to the word data to be recognized does not exist in the Chinese word segmentation tree, acquiring one or more Chinese characters of the current Chinese character from the text data to be recognized according to the character sequence of the current Chinese character in the text data to be recognized, and generating a first recognized word according to the one or more Chinese characters of the current Chinese character;

and determining whether a next Chinese character of the current Chinese character exists in the text data to be recognized according to the character sequence of the current Chinese character in the text data to be recognized;

and when the next Chinese character of the current Chinese character does not exist in the text data to be recognized, generating first word segmentation result data according to the first recognized word.

Preferably, the Chinese words in the Chinese word segmentation tree comprise word integrity identifications.

Further preferably, when there is a chinese word corresponding to the current word data to be recognized in the chinese word segmentation tree, the method further includes:

determining whether the word integrity mark of the current Chinese word is an integrity mark;

and when the word integrity mark of the current Chinese word is the integrity mark, generating a word segmentation position mark for the position of the current Chinese character in the text data to be identified.

Further preferably, after the generating the first word segmentation result data according to the first recognized word, the method further includes:

determining whether the word segmentation position marks exist in the text data to be identified;

when the word segmentation position marks exist in the text data to be identified, new text data to be identified is intercepted from the text data to be identified according to the word segmentation position marks;

performing word matching on the new text data to be recognized according to the Chinese word segmentation tree to obtain a second recognized word;

and generating second word segmentation result data according to the second recognized words.

Further preferably, the chinese terms in the chinese word segmentation tree include weight information.

Further preferably, the first recognized word and the second recognized word include the weight information.

Further preferably, after the generating of the second word segmentation result data from the second recognized word, the method further comprises:

obtaining a score value of the first word segmentation result data according to the weight information of the first recognized words in the first word segmentation result data, and obtaining a score value of the second word segmentation result data according to the weight information of the second recognized words in the second word segmentation result data;

and determining optimal word segmentation result data from the first word segmentation result data and the second word segmentation result data according to the score value of the first word segmentation result data and the score value of the second word segmentation result data.

Further preferably, before the word data to be identified is brought into the chinese word segmentation tree, the method further includes:

the processor in the semantic processing system trains the Chinese word segmentation tree according to Chinese words in a plurality of preset training databases;

the processor obtains weight information of the Chinese words according to the occurrence frequency of the Chinese words, so that the Chinese words in the Chinese analysis tree comprise the weight information.

Further preferably, before the processor obtains the weight information of the chinese word according to the occurrence frequency of the chinese word, the method further includes:

and determining the occurrence frequency of the Chinese words according to the application field priority information corresponding to the preset training database.

Preferably, the determining whether the chinese word corresponding to the current word data to be identified exists in the chinese word segmentation tree specifically includes:

and the processor determines whether the Chinese word corresponding to the word data to be identified exists in the Chinese word segmentation tree according to the corresponding relation between each father node and each son node in the Chinese word segmentation tree.

According to the Chinese word segmentation method provided by the embodiment of the invention, through the Chinese word segmentation tree with the tree-shaped hierarchical structure, according to the character sequence in the text data to be identified, sentences in the text data to be identified are split into a plurality of words with complete word meanings, and the final word segmentation result is obtained according to the weights of the words, so that the obtained word segmentation result is more accurate.

Drawings

Fig. 1 is a flowchart of a method for word segmentation in chinese according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

The embodiment of the invention firstly provides a Chinese word segmentation method which is realized in a semantic processing system and is used for segmenting a Chinese sentence into individual words, wherein the flow chart of the method is shown in figure 1 and comprises the following steps:

step 101, a semantic processing system receives text data to be identified;

in particular, a semantic processing system is understood to be a system with sentence input, processing and output functions. The semantic processing system includes a speech converter, an input queue, a poller, and a processor. When the semantic processing system is started, a listener configured in the system output page is started, and the listener loads configuration files for voice services, domain (domain) classes, user configuration files corresponding to domain, output sentences of the semantic processing system under specific conditions, and simultaneously starts a voice converter, an input queue, a polling device and a processor.

When a user wants to recognize a piece of Chinese voice or text content, the data to be recognized is input to the semantic processing system. The data to be recognized includes voice data to be recognized and text data to be recognized. The voice converter in the semantic processing system receives the data to be recognized, performs voice recognition on the voice data to be recognized in the data to be recognized to obtain text data to be recognized of the voice data to be recognized, and inserts the text data to be recognized of the voice data to be recognized or text data to be recognized, which is input by a user, into the tail end of an input queue in the semantic processing system.

The poller in the semantic processing system always monitors whether the input queue has new messages, namely whether text data to be recognized enter the queue, and acquires the text data to be recognized at the tail of the input queue from the input queue.

Step 102, extracting Chinese characters in text data to be recognized according to the character sequence, and generating word data to be recognized according to the current Chinese characters;

specifically, a processor in the semantic processing system firstly extracts a Chinese character which appears first in the text data to be recognized according to the character sequence in the text data to be recognized, and generates word data to be recognized according to the Chinese character which appears first.

In a specific example, if the text data to be identified input by the user is "the vitiligo patient has a cold", the processor in the semantic processing system first extracts, according to the character sequence in the text data to be identified, a Chinese character "white" that appears first in the text data to be identified as word data to be identified.

Step 103, the current word data to be identified is carried into a Chinese word segmentation tree, and whether the Chinese word corresponding to the current word data to be identified exists in the Chinese word segmentation tree is determined;

specifically, the chinese word segmentation tree may be understood as a rule set having a tree structure, or may be understood as a word stock having a tree structure. The chinese terms in the chinese word segmentation tree include term integrity identifications. Integrity flag may be understood as a flag for identifying whether the current word is one with a complete meaning.

Before the word data to be identified is brought into the Chinese word segmentation tree, training is required to be carried out on the Chinese rule tree, and the tree structure and weight information of each word are determined.

Further specifically, the processor may be coupled to a plurality of predetermined training databases and train the chinese word segmentation tree based on chinese words in the predetermined training databases. Each preset training database corresponds to a corresponding application field, and each application field corresponds to priority information. The processor determines the occurrence frequency of the Chinese words corresponding to the application domain priority information in training according to the application domain priority information set by the user. That is, when the processor acquires the chinese words from each preset training database, the higher the priority information of the application field corresponding to the preset training database, the higher the frequency of acquiring the chinese words in the preset training database, so that the higher the priority information of the application field, the higher the occurrence frequency of the chinese words during training. And during training, the processor determines weight information of the Chinese data according to the occurrence frequency of the Chinese words, so that the Chinese words in the trained Chinese analysis tree comprise the weight information.

After training the Chinese word segmentation tree, the processor brings word data to be identified into the trained Chinese word segmentation tree, and determines whether Chinese words corresponding to the word data to be identified exist in the Chinese word segmentation tree. If there is a chinese word corresponding to the current word data to be recognized in the chinese word segmentation tree, the following step 104 is performed, and if there is no chinese word corresponding to the current word data to be recognized in the chinese word segmentation tree, the following step 105 is performed.

In some preferred embodiments, when determining whether a chinese word corresponding to the word data to be identified exists in the chinese word segmentation tree, the processor needs to search for the chinese word in the chinese word segmentation tree according to the tree structure of the chinese word segmentation tree, that is, the correspondence between each parent node and each child node in the chinese word segmentation tree. If it is determined that there is a chinese word corresponding to the current word data to be recognized in the previous time, when it is determined again whether there is a chinese word corresponding to the next word data to be recognized, it is necessary to determine in the sub-node corresponding to the chinese word obtained in the previous time, so as to save processing time.

Step 104, determining the word integrity mark of the current Chinese word;

specifically, when a Chinese word corresponding to the current word data to be identified exists in the Chinese word segmentation tree, the processor determines whether the word integrity mark of the current Chinese word is a complete mark. That is, when there is a chinese word corresponding to the current word data to be recognized in the chinese word segmentation tree, the processor needs to determine whether the current word data to be recognized has a complete meaning. It is understood that the Chinese word with the integrity mark being the integrity mark is not necessarily the last child node of the Chinese word segmentation tree.

In a specific example, the Chinese word segmentation tree includes three Chinese words of "white" - "Bai Dian" - "vitiligo", wherein the integrity marks of "white" and "vitiligo" are complete marks, and the integrity mark of "white patch" is incomplete mark. For another example, the Chinese word segmentation tree also includes four Chinese words of "ping-pong" - "ping-pong racket", wherein the integrity marks of "ping-pong" and "ping-pong" are not complete marks, and the integrity marks of "ping-pong" and "ping-pong racket" are complete marks, but in the tree-level relationship of "ping-pong" - "ping-pong racket", although the integrity marks of "ping-pong" and "ping-pong racket" are complete marks, only "ping-pong racket" is a child node at the extreme end of the Chinese word segmentation tree.

When the word integrity mark of the current Chinese word is the integrity mark, the processor generates a word segmentation position mark for the position of the current Chinese character in the text data to be identified.

Step 105, generating a first recognized word according to one or more Chinese characters above the current Chinese character;

specifically, when the Chinese word corresponding to the current word data to be recognized does not exist in the Chinese word segmentation tree, the processor obtains one or more Chinese characters before the current Chinese character in the text data to be recognized according to the character sequence of the current Chinese character in the text data to be recognized, and generates a first recognized word according to the one or more Chinese characters before the current Chinese character. Generating the recognized term may be understood as obtaining a full meaning chinese term in the text data currently to be recognized. The first recognized word may be understood as a word recognized in the process of determining the child node in the chinese word segmentation tree, which is located at the extreme end of the chinese word segmentation tree, according to the character order of the text data to be recognized.

In a specific example, what is the patient who has suffered a cold and should take what is the patient who has suffered a cold, which is input by the user, is received by the semantic processing system? And after extracting the first Chinese character 'white' in the text data to be recognized, the processor in the semantic processing system obtains the current word data 'white' to be recognized according to the Chinese character 'white'. The processor determines that the current word data to be identified is white in the Chinese word segmentation tree, determines that the next Chinese character 'Dian-type' of the current Chinese character 'white' exists in the current text data to be identified, continuously identifies the next Chinese character 'Dian-type' appearing next after the Chinese character 'white', updates the word data to be identified according to the previous Chinese character 'white' and the current Chinese character 'Dian-type', continuously determines the next Chinese character 'wind' of the current Chinese character 'Dian-type' in the current text data to be identified, continuously updates the word data to be identified according to the same process, and then updates the word data to be identified to be 'vitiligo' according to the previous Chinese character 'white' and the current Chinese character 'Dian-type'. However, after the processor continues to extract the next Chinese character "suffering from" of the current Chinese character "wind", no Chinese word identical to the word data "vitiligo suffering from" to be recognized currently is queried in the Chinese word segmentation tree, the processor determines a plurality of Chinese characters "vitiligo" existing before the current Chinese character "suffering from" in the text data to be recognized as matched Chinese characters, and generates a first recognized word "vitiligo" according to the matched Chinese characters.

After the steps 104 and 105 are performed, step 106 is required to determine whether the next chinese character exists in the text data to be recognized, that is, to continue to determine whether all chinese characters in the text data to be recognized have been extracted.

Step 106, determining whether the next Chinese character exists in the text data to be recognized;

specifically, when the chinese word corresponding to the current word data to be recognized exists in the chinese word segmentation tree, the processor determines whether the next chinese character exists in the text data to be recognized according to the character sequence of the current chinese character in the text data to be recognized, which can also be understood as a process of determining whether all the chinese characters in the text data to be recognized have been extracted.

When the next chinese character of the current chinese character exists in the text data to be recognized, the following step 107 is executed to indicate that the chinese character in the text data to be recognized has not been completely extracted, and when the next chinese character of the current chinese character does not exist in the text data to be recognized, the following step 108 is executed to indicate that the chinese character in the text data to be recognized has been completely extracted.

Step 107, extracting the next Chinese character in the text data to be recognized according to the character sequence, and updating the word data to be recognized;

specifically, when the next Chinese character of the current Chinese character exists in the text data to be recognized, the semantic processing system extracts one Chinese character appearing next in the text data to be recognized according to the character sequence, and updates the word data to be recognized according to the next Chinese character. And then returning to execute step 103, continuing to bring the current updated word data to be identified into a Chinese word segmentation tree, determining whether a Chinese word corresponding to the current updated word data to be identified exists in the Chinese word segmentation tree, and executing step 108 until it is determined in step 106 that the next Chinese character does not exist in the text data to be identified.

Step 108, generating first word segmentation result data according to the first recognized words;

specifically, when the next chinese character of the current chinese character does not exist in the text data to be recognized, that is, when the chinese characters in the text data to be recognized have been all extracted, the processor has obtained one or more first recognized words, and then generates first word segmentation result data according to the one or more first recognized words. The first word segmentation result data is a word segmentation result corresponding to the first recognized word.

It should be noted that, in some cases, one text data to be recognized may correspond to more than one word segmentation result. For example, in a sentence of "table tennis auctioned," where the word segmentation results include "table tennis auctioned" and "table tennis bat sold," both word segmentation results are semantically smooth, but only one word segmentation result can be semantically matched when subsequent semantically matched. That is, the semantic processing system needs to obtain not only the word segmentation result corresponding to the first recognized word, but also determine whether the text data to be recognized currently has a plurality of word segmentation results, that is, whether the text data to be recognized currently has a second word segmentation result, and when the text data to be recognized currently has a plurality of word segmentation results at the same time, select one word segmentation result which most probably accords with the semantic meaning from the plurality of word segmentation results.

It is understood that, with respect to the first recognized word, the second recognized word may be understood as a word segmentation result obtained after re-segmentation according to the word segmentation position identification.

Therefore, after the first recognized word is generated, step 109 is also performed.

Step 109, determining whether a word segmentation position mark exists in the text data to be identified;

specifically, when the word segmentation position mark exists in the text data to be recognized, the processor needs to further determine that the current text data to be recognized may correspond to a plurality of word segmentation results, the following step 110 is executed, and when any word segmentation position mark does not exist in the text data to be recognized, the processor determines that the current text data to be recognized may correspond to a plurality of word segmentation results, the following step 113 is executed.

Step 110, new text data to be recognized is intercepted from the text data to be recognized according to the word segmentation position marks;

specifically, when the word segmentation position marks exist in the text data to be identified, the processor intercepts new text data to be identified backwards from the text data to be identified according to the word segmentation position marks.

In a specific example, the Chinese word segmentation tree further comprises four Chinese words of ping-pong, ping-pong ball, ping-pong racket, wherein the integrity marks of ping-pong and ping-pong are not complete marks, and the integrity marks of ping-pong ball and ping-pong racket are complete marks. When the text data to be identified is 'table tennis auctioned', the processor extracts word data to be identified of 'table tennis', and then determines that the integrity marks of the 'table tennis' are all complete marks, the processor generates a word segmentation position mark after 'table tennis' in 'table tennis auctioned', and intercepts new text data to be identified from 'table tennis' in the text data to be identified backwards according to the word segmentation position mark.

Step 111, performing word matching on the new text data to be identified according to the Chinese word segmentation tree to obtain a second identified word;

specifically, the processor firstly extracts Chinese characters in new text data to be recognized according to the character sequence, generates word data to be recognized according to the current Chinese characters, then brings the current word data to be recognized into a Chinese word segmentation tree, determines whether Chinese words corresponding to the current word data to be recognized exist in the Chinese word segmentation tree, and generates second recognized words according to one or more Chinese characters of the current Chinese characters when the Chinese words corresponding to the word data to be recognized of the current information exist in the Chinese word segmentation tree.

However, unlike the first recognized word, the second recognized word includes, in addition to the word obtained by performing word matching on the new text data to be recognized according to the chinese word segmentation tree, the text data to be recognized, whose previous integrity is identified as a complete identifier, that is, the word preceding the current new text data to be recognized. For example, the original text data to be recognized input by the user is "table tennis auctioned", the processor generates a word segmentation position mark after the table tennis is auctioned, and intercepts new text data to be recognized from the table tennis in the text data to be recognized according to the word segmentation position mark, that is, the table tennis before the new text data to be recognized after being intercepted currently and the table tennis after the new text data to be recognized are both second recognized words.

Step 112, generating second word segmentation result data according to the second recognized words;

specifically, when the chinese characters in the new text data to be recognized have been extracted in their entirety, the processor has obtained one or more second recognized words and then generates second word segmentation result data based on the one or more second recognized words. The second word segmentation result data is a word segmentation result corresponding to the second recognized word.

After this step is performed, it is necessary to return to step 109 to determine whether there are any other word segmentation position markers in the current new text data to be identified. That is, the second word segmentation result data may be plural, and the processor needs to obtain each word segmentation result.

Step 113, determining the optimal word segmentation result data and outputting the data;

specifically, when the semantic processing system receives that the text data to be recognized has a plurality of word segmentation results, the processor needs to determine an optimal word segmentation result from the plurality of word segmentation results and output the optimal word segmentation result.

Further specifically, when the processor finds a chinese word corresponding to the first identified word or the second identified word in the chinese parse tree, the weight information of the current first identified word or the second identified word may be determined according to the weight information of the chinese word in the chinese parse tree, such that the first identified word and the second identified word include the weight information. Then, the processor obtains a score value of the first word segmentation result data according to the weight information of the first recognized word in the first word segmentation result data, obtains a score value of the second word segmentation result data according to the weight information of the second recognized word in the second word segmentation result data, and determines that analysis result data with the highest score value is optimal word segmentation result data from the first word segmentation result data and the second word segmentation result data according to the score value of the first word segmentation result data and the score value of the second word segmentation result data.

It will be appreciated that if there are a plurality of second word segmentation result data, the processor needs to determine a score value for each second word segmentation result data, compare all the obtained word segmentation result data, and determine the analysis result data with the highest score value as the optimal word segmentation result data.

In an example of the chinese word segmentation according to the above steps, the text data to be recognized input by the user is "table tennis auctioned", and the chinese word segmentation tree includes chinese words of "ping" - "table tennis racket", "selling" and "auction" - "where the weight information of" ping "-" table tennis "is" 1 minute "," the weight information of "table tennis" is "5 minutes", "the weight information of" table tennis racket "is" 5 minutes "," the weight information of "selling" is "5 minutes", "the weight information of" bat "is" 1 minute "," the weight information of auction "is" 4 minutes ". The semantic processing system can obtain the first recognized words of 'table tennis bat' and 'sell', the corresponding first word segmentation result data is 'table tennis bat sold', and the score value of the first word segmentation result data is '10 score'. Then the semantic processing system can also obtain second recognized words 'table tennis' and 'auction', the corresponding second word result data is 'table tennis', the score value of the second word result data is '9 minutes', and the processor selects the first word result data 'table tennis bat' with the highest score value, and the second word result data 'table tennis bat' is sold as the final word result.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a user terminal, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method for chinese word segmentation, the method comprising:

the semantic processing system receives text data to be identified;

when the next Chinese character of the current Chinese character does not exist in the text data to be recognized, generating first word segmentation result data according to the first recognized word;

wherein, the Chinese words comprise word integrity marks; when the Chinese word corresponding to the current word data to be identified exists in the Chinese word segmentation tree, the method further comprises the following steps:

when the word integrity mark of the current Chinese word is the integrity mark, generating a word segmentation position mark for the position of the current Chinese character in the text data to be identified;

wherein after the generating first word segmentation result data from the first recognized word, the method further comprises:

2. The method of claim 1, wherein the chinese terms in the chinese word segmentation tree include weight information.

3. The chinese word segmentation method of claim 2 wherein the first recognized word and the second recognized word comprise the weight information.

4. A method of chinese word segmentation as defined in claim 3 wherein, subsequent to the generating second word segmentation result data from the second recognized word, the method further comprises:

5. The chinese word segmentation method of claim 2 wherein prior to said bringing the word data to be identified into a chinese word segmentation tree, the method further comprises:

the processor obtains weight information of the Chinese words according to the occurrence frequency of the Chinese words, so that the Chinese words in the Chinese word segmentation tree comprise the weight information.

6. The method of claim 5, wherein before the processor obtains the weight information of the chinese word according to the occurrence frequency of the chinese word, the method further comprises:

7. The method of claim 1, wherein determining whether a chinese word corresponding to the current word data to be identified exists in the chinese word segmentation tree comprises: