CN109992776B - Chinese word segmentation method - Google Patents

Chinese word segmentation method Download PDF

Info

Publication number
CN109992776B
CN109992776B CN201910231584.6A CN201910231584A CN109992776B CN 109992776 B CN109992776 B CN 109992776B CN 201910231584 A CN201910231584 A CN 201910231584A CN 109992776 B CN109992776 B CN 109992776B
Authority
CN
China
Prior art keywords
chinese
word
recognized
word segmentation
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910231584.6A
Other languages
Chinese (zh)
Other versions
CN109992776A (en
Inventor
孙晓光
张海风
郝玲风
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bray Tongyun Culture Communication Co ltd
Original Assignee
Beijing Bray Tongyun Culture Communication Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bray Tongyun Culture Communication Co ltd filed Critical Beijing Bray Tongyun Culture Communication Co ltd
Priority to CN201910231584.6A priority Critical patent/CN109992776B/en
Publication of CN109992776A publication Critical patent/CN109992776A/en
Application granted granted Critical
Publication of CN109992776B publication Critical patent/CN109992776B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to a Chinese word segmentation method, which comprises the following steps: generating word data to be recognized according to the current Chinese character; the current word data to be identified is brought into a Chinese word segmentation tree; when the Chinese word corresponding to the current word data to be recognized exists in the Chinese word segmentation tree, determining whether the next Chinese character of the current Chinese character exists in the text data to be recognized; when the next Chinese character exists in the text data to be recognized, updating the word data to be recognized according to the next Chinese character, and determining whether a Chinese word corresponding to the updated word data to be recognized exists in a Chinese word segmentation tree; when the Chinese word corresponding to the word data to be recognized does not exist in the Chinese word segmentation tree, generating a first recognized word according to one or more Chinese characters of the current Chinese character; when the next Chinese character of the current Chinese character does not exist in the text data to be recognized, first word segmentation result data is generated according to the first recognized words.

Description

Chinese word segmentation method
Technical Field
The invention relates to the technical field of data processing, in particular to a Chinese word segmentation method.
Background
Chinese segmentation refers to the segmentation of a sequence of chinese characters into individual words. Word segmentation is the process of recombining a continuous word sequence into a word sequence according to a certain specification. The words in the Chinese text do not have a delimiter in a form, so that word segmentation cannot be carried out according to the delimiter in the Chinese text, and great difficulty is brought to Chinese word segmentation. Moreover, some sentences may have multiple splitting results, and how to determine the splitting result most conforming to the semantics is the final splitting result, which is also a challenge in the field of Chinese word segmentation at present.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a Chinese word segmentation method, which divides sentences in text data to be identified into a plurality of words with complete word meanings according to the character sequence in the text data to be identified through a Chinese word segmentation tree with a tree-shaped hierarchical structure, and obtains a final word segmentation result according to the weights of the words, so that the obtained word segmentation result is more accurate.
In order to achieve the above object, the present invention provides a method for word segmentation in chinese, the method comprising:
the semantic processing system receives text data to be identified;
extracting Chinese characters in the text data to be recognized according to the character sequence, and generating word data to be recognized according to the current Chinese characters;
the method comprises the steps of bringing current word data to be identified into a Chinese word segmentation tree, and determining whether a Chinese word corresponding to the current word data to be identified exists in the Chinese word segmentation tree;
when the Chinese word corresponding to the current word data to be recognized exists in the Chinese word segmentation tree, determining whether the next Chinese character of the current Chinese character exists in the text data to be recognized according to the character sequence of the current Chinese character in the text data to be recognized;
extracting the next Chinese character in the text data to be recognized according to the character sequence when the next Chinese character exists in the text data to be recognized;
updating the word data to be identified according to the next Chinese character, and determining whether a Chinese word corresponding to the updated word data to be identified exists in the Chinese word segmentation tree;
when the Chinese word corresponding to the word data to be recognized does not exist in the Chinese word segmentation tree, acquiring one or more Chinese characters of the current Chinese character from the text data to be recognized according to the character sequence of the current Chinese character in the text data to be recognized, and generating a first recognized word according to the one or more Chinese characters of the current Chinese character;
and determining whether a next Chinese character of the current Chinese character exists in the text data to be recognized according to the character sequence of the current Chinese character in the text data to be recognized;
and when the next Chinese character of the current Chinese character does not exist in the text data to be recognized, generating first word segmentation result data according to the first recognized word.
Preferably, the Chinese words in the Chinese word segmentation tree comprise word integrity identifications.
Further preferably, when there is a chinese word corresponding to the current word data to be recognized in the chinese word segmentation tree, the method further includes:
determining whether the word integrity mark of the current Chinese word is an integrity mark;
and when the word integrity mark of the current Chinese word is the integrity mark, generating a word segmentation position mark for the position of the current Chinese character in the text data to be identified.
Further preferably, after the generating the first word segmentation result data according to the first recognized word, the method further includes:
determining whether the word segmentation position marks exist in the text data to be identified;
when the word segmentation position marks exist in the text data to be identified, new text data to be identified is intercepted from the text data to be identified according to the word segmentation position marks;
performing word matching on the new text data to be recognized according to the Chinese word segmentation tree to obtain a second recognized word;
and generating second word segmentation result data according to the second recognized words.
Further preferably, the chinese terms in the chinese word segmentation tree include weight information.
Further preferably, the first recognized word and the second recognized word include the weight information.
Further preferably, after the generating of the second word segmentation result data from the second recognized word, the method further comprises:
obtaining a score value of the first word segmentation result data according to the weight information of the first recognized words in the first word segmentation result data, and obtaining a score value of the second word segmentation result data according to the weight information of the second recognized words in the second word segmentation result data;
and determining optimal word segmentation result data from the first word segmentation result data and the second word segmentation result data according to the score value of the first word segmentation result data and the score value of the second word segmentation result data.
Further preferably, before the word data to be identified is brought into the chinese word segmentation tree, the method further includes:
the processor in the semantic processing system trains the Chinese word segmentation tree according to Chinese words in a plurality of preset training databases;
the processor obtains weight information of the Chinese words according to the occurrence frequency of the Chinese words, so that the Chinese words in the Chinese analysis tree comprise the weight information.
Further preferably, before the processor obtains the weight information of the chinese word according to the occurrence frequency of the chinese word, the method further includes:
and determining the occurrence frequency of the Chinese words according to the application field priority information corresponding to the preset training database.
Preferably, the determining whether the chinese word corresponding to the current word data to be identified exists in the chinese word segmentation tree specifically includes:
and the processor determines whether the Chinese word corresponding to the word data to be identified exists in the Chinese word segmentation tree according to the corresponding relation between each father node and each son node in the Chinese word segmentation tree.
According to the Chinese word segmentation method provided by the embodiment of the invention, through the Chinese word segmentation tree with the tree-shaped hierarchical structure, according to the character sequence in the text data to be identified, sentences in the text data to be identified are split into a plurality of words with complete word meanings, and the final word segmentation result is obtained according to the weights of the words, so that the obtained word segmentation result is more accurate.
Drawings
Fig. 1 is a flowchart of a method for word segmentation in chinese according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
The embodiment of the invention firstly provides a Chinese word segmentation method which is realized in a semantic processing system and is used for segmenting a Chinese sentence into individual words, wherein the flow chart of the method is shown in figure 1 and comprises the following steps:
step 101, a semantic processing system receives text data to be identified;
in particular, a semantic processing system is understood to be a system with sentence input, processing and output functions. The semantic processing system includes a speech converter, an input queue, a poller, and a processor. When the semantic processing system is started, a listener configured in the system output page is started, and the listener loads configuration files for voice services, domain (domain) classes, user configuration files corresponding to domain, output sentences of the semantic processing system under specific conditions, and simultaneously starts a voice converter, an input queue, a polling device and a processor.
When a user wants to recognize a piece of Chinese voice or text content, the data to be recognized is input to the semantic processing system. The data to be recognized includes voice data to be recognized and text data to be recognized. The voice converter in the semantic processing system receives the data to be recognized, performs voice recognition on the voice data to be recognized in the data to be recognized to obtain text data to be recognized of the voice data to be recognized, and inserts the text data to be recognized of the voice data to be recognized or text data to be recognized, which is input by a user, into the tail end of an input queue in the semantic processing system.
The poller in the semantic processing system always monitors whether the input queue has new messages, namely whether text data to be recognized enter the queue, and acquires the text data to be recognized at the tail of the input queue from the input queue.
Step 102, extracting Chinese characters in text data to be recognized according to the character sequence, and generating word data to be recognized according to the current Chinese characters;
specifically, a processor in the semantic processing system firstly extracts a Chinese character which appears first in the text data to be recognized according to the character sequence in the text data to be recognized, and generates word data to be recognized according to the Chinese character which appears first.
In a specific example, if the text data to be identified input by the user is "the vitiligo patient has a cold", the processor in the semantic processing system first extracts, according to the character sequence in the text data to be identified, a Chinese character "white" that appears first in the text data to be identified as word data to be identified.
Step 103, the current word data to be identified is carried into a Chinese word segmentation tree, and whether the Chinese word corresponding to the current word data to be identified exists in the Chinese word segmentation tree is determined;
specifically, the chinese word segmentation tree may be understood as a rule set having a tree structure, or may be understood as a word stock having a tree structure. The chinese terms in the chinese word segmentation tree include term integrity identifications. Integrity flag may be understood as a flag for identifying whether the current word is one with a complete meaning.
Before the word data to be identified is brought into the Chinese word segmentation tree, training is required to be carried out on the Chinese rule tree, and the tree structure and weight information of each word are determined.
Further specifically, the processor may be coupled to a plurality of predetermined training databases and train the chinese word segmentation tree based on chinese words in the predetermined training databases. Each preset training database corresponds to a corresponding application field, and each application field corresponds to priority information. The processor determines the occurrence frequency of the Chinese words corresponding to the application domain priority information in training according to the application domain priority information set by the user. That is, when the processor acquires the chinese words from each preset training database, the higher the priority information of the application field corresponding to the preset training database, the higher the frequency of acquiring the chinese words in the preset training database, so that the higher the priority information of the application field, the higher the occurrence frequency of the chinese words during training. And during training, the processor determines weight information of the Chinese data according to the occurrence frequency of the Chinese words, so that the Chinese words in the trained Chinese analysis tree comprise the weight information.
After training the Chinese word segmentation tree, the processor brings word data to be identified into the trained Chinese word segmentation tree, and determines whether Chinese words corresponding to the word data to be identified exist in the Chinese word segmentation tree. If there is a chinese word corresponding to the current word data to be recognized in the chinese word segmentation tree, the following step 104 is performed, and if there is no chinese word corresponding to the current word data to be recognized in the chinese word segmentation tree, the following step 105 is performed.
In some preferred embodiments, when determining whether a chinese word corresponding to the word data to be identified exists in the chinese word segmentation tree, the processor needs to search for the chinese word in the chinese word segmentation tree according to the tree structure of the chinese word segmentation tree, that is, the correspondence between each parent node and each child node in the chinese word segmentation tree. If it is determined that there is a chinese word corresponding to the current word data to be recognized in the previous time, when it is determined again whether there is a chinese word corresponding to the next word data to be recognized, it is necessary to determine in the sub-node corresponding to the chinese word obtained in the previous time, so as to save processing time.
Step 104, determining the word integrity mark of the current Chinese word;
specifically, when a Chinese word corresponding to the current word data to be identified exists in the Chinese word segmentation tree, the processor determines whether the word integrity mark of the current Chinese word is a complete mark. That is, when there is a chinese word corresponding to the current word data to be recognized in the chinese word segmentation tree, the processor needs to determine whether the current word data to be recognized has a complete meaning. It is understood that the Chinese word with the integrity mark being the integrity mark is not necessarily the last child node of the Chinese word segmentation tree.
In a specific example, the Chinese word segmentation tree includes three Chinese words of "white" - "Bai Dian" - "vitiligo", wherein the integrity marks of "white" and "vitiligo" are complete marks, and the integrity mark of "white patch" is incomplete mark. For another example, the Chinese word segmentation tree also includes four Chinese words of "ping-pong" - "ping-pong racket", wherein the integrity marks of "ping-pong" and "ping-pong" are not complete marks, and the integrity marks of "ping-pong" and "ping-pong racket" are complete marks, but in the tree-level relationship of "ping-pong" - "ping-pong racket", although the integrity marks of "ping-pong" and "ping-pong racket" are complete marks, only "ping-pong racket" is a child node at the extreme end of the Chinese word segmentation tree.
When the word integrity mark of the current Chinese word is the integrity mark, the processor generates a word segmentation position mark for the position of the current Chinese character in the text data to be identified.
Step 105, generating a first recognized word according to one or more Chinese characters above the current Chinese character;
specifically, when the Chinese word corresponding to the current word data to be recognized does not exist in the Chinese word segmentation tree, the processor obtains one or more Chinese characters before the current Chinese character in the text data to be recognized according to the character sequence of the current Chinese character in the text data to be recognized, and generates a first recognized word according to the one or more Chinese characters before the current Chinese character. Generating the recognized term may be understood as obtaining a full meaning chinese term in the text data currently to be recognized. The first recognized word may be understood as a word recognized in the process of determining the child node in the chinese word segmentation tree, which is located at the extreme end of the chinese word segmentation tree, according to the character order of the text data to be recognized.
In a specific example, what is the patient who has suffered a cold and should take what is the patient who has suffered a cold, which is input by the user, is received by the semantic processing system? And after extracting the first Chinese character 'white' in the text data to be recognized, the processor in the semantic processing system obtains the current word data 'white' to be recognized according to the Chinese character 'white'. The processor determines that the current word data to be identified is white in the Chinese word segmentation tree, determines that the next Chinese character 'Dian-type' of the current Chinese character 'white' exists in the current text data to be identified, continuously identifies the next Chinese character 'Dian-type' appearing next after the Chinese character 'white', updates the word data to be identified according to the previous Chinese character 'white' and the current Chinese character 'Dian-type', continuously determines the next Chinese character 'wind' of the current Chinese character 'Dian-type' in the current text data to be identified, continuously updates the word data to be identified according to the same process, and then updates the word data to be identified to be 'vitiligo' according to the previous Chinese character 'white' and the current Chinese character 'Dian-type'. However, after the processor continues to extract the next Chinese character "suffering from" of the current Chinese character "wind", no Chinese word identical to the word data "vitiligo suffering from" to be recognized currently is queried in the Chinese word segmentation tree, the processor determines a plurality of Chinese characters "vitiligo" existing before the current Chinese character "suffering from" in the text data to be recognized as matched Chinese characters, and generates a first recognized word "vitiligo" according to the matched Chinese characters.
After the steps 104 and 105 are performed, step 106 is required to determine whether the next chinese character exists in the text data to be recognized, that is, to continue to determine whether all chinese characters in the text data to be recognized have been extracted.
Step 106, determining whether the next Chinese character exists in the text data to be recognized;
specifically, when the chinese word corresponding to the current word data to be recognized exists in the chinese word segmentation tree, the processor determines whether the next chinese character exists in the text data to be recognized according to the character sequence of the current chinese character in the text data to be recognized, which can also be understood as a process of determining whether all the chinese characters in the text data to be recognized have been extracted.
When the next chinese character of the current chinese character exists in the text data to be recognized, the following step 107 is executed to indicate that the chinese character in the text data to be recognized has not been completely extracted, and when the next chinese character of the current chinese character does not exist in the text data to be recognized, the following step 108 is executed to indicate that the chinese character in the text data to be recognized has been completely extracted.
Step 107, extracting the next Chinese character in the text data to be recognized according to the character sequence, and updating the word data to be recognized;
specifically, when the next Chinese character of the current Chinese character exists in the text data to be recognized, the semantic processing system extracts one Chinese character appearing next in the text data to be recognized according to the character sequence, and updates the word data to be recognized according to the next Chinese character. And then returning to execute step 103, continuing to bring the current updated word data to be identified into a Chinese word segmentation tree, determining whether a Chinese word corresponding to the current updated word data to be identified exists in the Chinese word segmentation tree, and executing step 108 until it is determined in step 106 that the next Chinese character does not exist in the text data to be identified.
Step 108, generating first word segmentation result data according to the first recognized words;
specifically, when the next chinese character of the current chinese character does not exist in the text data to be recognized, that is, when the chinese characters in the text data to be recognized have been all extracted, the processor has obtained one or more first recognized words, and then generates first word segmentation result data according to the one or more first recognized words. The first word segmentation result data is a word segmentation result corresponding to the first recognized word.
It should be noted that, in some cases, one text data to be recognized may correspond to more than one word segmentation result. For example, in a sentence of "table tennis auctioned," where the word segmentation results include "table tennis auctioned" and "table tennis bat sold," both word segmentation results are semantically smooth, but only one word segmentation result can be semantically matched when subsequent semantically matched. That is, the semantic processing system needs to obtain not only the word segmentation result corresponding to the first recognized word, but also determine whether the text data to be recognized currently has a plurality of word segmentation results, that is, whether the text data to be recognized currently has a second word segmentation result, and when the text data to be recognized currently has a plurality of word segmentation results at the same time, select one word segmentation result which most probably accords with the semantic meaning from the plurality of word segmentation results.
It is understood that, with respect to the first recognized word, the second recognized word may be understood as a word segmentation result obtained after re-segmentation according to the word segmentation position identification.
Therefore, after the first recognized word is generated, step 109 is also performed.
Step 109, determining whether a word segmentation position mark exists in the text data to be identified;
specifically, when the word segmentation position mark exists in the text data to be recognized, the processor needs to further determine that the current text data to be recognized may correspond to a plurality of word segmentation results, the following step 110 is executed, and when any word segmentation position mark does not exist in the text data to be recognized, the processor determines that the current text data to be recognized may correspond to a plurality of word segmentation results, the following step 113 is executed.
Step 110, new text data to be recognized is intercepted from the text data to be recognized according to the word segmentation position marks;
specifically, when the word segmentation position marks exist in the text data to be identified, the processor intercepts new text data to be identified backwards from the text data to be identified according to the word segmentation position marks.
In a specific example, the Chinese word segmentation tree further comprises four Chinese words of ping-pong, ping-pong ball, ping-pong racket, wherein the integrity marks of ping-pong and ping-pong are not complete marks, and the integrity marks of ping-pong ball and ping-pong racket are complete marks. When the text data to be identified is 'table tennis auctioned', the processor extracts word data to be identified of 'table tennis', and then determines that the integrity marks of the 'table tennis' are all complete marks, the processor generates a word segmentation position mark after 'table tennis' in 'table tennis auctioned', and intercepts new text data to be identified from 'table tennis' in the text data to be identified backwards according to the word segmentation position mark.
Step 111, performing word matching on the new text data to be identified according to the Chinese word segmentation tree to obtain a second identified word;
specifically, the processor firstly extracts Chinese characters in new text data to be recognized according to the character sequence, generates word data to be recognized according to the current Chinese characters, then brings the current word data to be recognized into a Chinese word segmentation tree, determines whether Chinese words corresponding to the current word data to be recognized exist in the Chinese word segmentation tree, and generates second recognized words according to one or more Chinese characters of the current Chinese characters when the Chinese words corresponding to the word data to be recognized of the current information exist in the Chinese word segmentation tree.
However, unlike the first recognized word, the second recognized word includes, in addition to the word obtained by performing word matching on the new text data to be recognized according to the chinese word segmentation tree, the text data to be recognized, whose previous integrity is identified as a complete identifier, that is, the word preceding the current new text data to be recognized. For example, the original text data to be recognized input by the user is "table tennis auctioned", the processor generates a word segmentation position mark after the table tennis is auctioned, and intercepts new text data to be recognized from the table tennis in the text data to be recognized according to the word segmentation position mark, that is, the table tennis before the new text data to be recognized after being intercepted currently and the table tennis after the new text data to be recognized are both second recognized words.
Step 112, generating second word segmentation result data according to the second recognized words;
specifically, when the chinese characters in the new text data to be recognized have been extracted in their entirety, the processor has obtained one or more second recognized words and then generates second word segmentation result data based on the one or more second recognized words. The second word segmentation result data is a word segmentation result corresponding to the second recognized word.
After this step is performed, it is necessary to return to step 109 to determine whether there are any other word segmentation position markers in the current new text data to be identified. That is, the second word segmentation result data may be plural, and the processor needs to obtain each word segmentation result.
Step 113, determining the optimal word segmentation result data and outputting the data;
specifically, when the semantic processing system receives that the text data to be recognized has a plurality of word segmentation results, the processor needs to determine an optimal word segmentation result from the plurality of word segmentation results and output the optimal word segmentation result.
Further specifically, when the processor finds a chinese word corresponding to the first identified word or the second identified word in the chinese parse tree, the weight information of the current first identified word or the second identified word may be determined according to the weight information of the chinese word in the chinese parse tree, such that the first identified word and the second identified word include the weight information. Then, the processor obtains a score value of the first word segmentation result data according to the weight information of the first recognized word in the first word segmentation result data, obtains a score value of the second word segmentation result data according to the weight information of the second recognized word in the second word segmentation result data, and determines that analysis result data with the highest score value is optimal word segmentation result data from the first word segmentation result data and the second word segmentation result data according to the score value of the first word segmentation result data and the score value of the second word segmentation result data.
It will be appreciated that if there are a plurality of second word segmentation result data, the processor needs to determine a score value for each second word segmentation result data, compare all the obtained word segmentation result data, and determine the analysis result data with the highest score value as the optimal word segmentation result data.
In an example of the chinese word segmentation according to the above steps, the text data to be recognized input by the user is "table tennis auctioned", and the chinese word segmentation tree includes chinese words of "ping" - "table tennis racket", "selling" and "auction" - "where the weight information of" ping "-" table tennis "is" 1 minute "," the weight information of "table tennis" is "5 minutes", "the weight information of" table tennis racket "is" 5 minutes "," the weight information of "selling" is "5 minutes", "the weight information of" bat "is" 1 minute "," the weight information of auction "is" 4 minutes ". The semantic processing system can obtain the first recognized words of 'table tennis bat' and 'sell', the corresponding first word segmentation result data is 'table tennis bat sold', and the score value of the first word segmentation result data is '10 score'. Then the semantic processing system can also obtain second recognized words 'table tennis' and 'auction', the corresponding second word result data is 'table tennis', the score value of the second word result data is '9 minutes', and the processor selects the first word result data 'table tennis bat' with the highest score value, and the second word result data 'table tennis bat' is sold as the final word result.
According to the Chinese word segmentation method provided by the embodiment of the invention, through the Chinese word segmentation tree with the tree-shaped hierarchical structure, according to the character sequence in the text data to be identified, sentences in the text data to be identified are split into a plurality of words with complete word meanings, and the final word segmentation result is obtained according to the weights of the words, so that the obtained word segmentation result is more accurate.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a user terminal, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (7)

1. A method for chinese word segmentation, the method comprising:
the semantic processing system receives text data to be identified;
extracting Chinese characters in the text data to be recognized according to the character sequence, and generating word data to be recognized according to the current Chinese characters;
the method comprises the steps of bringing current word data to be identified into a Chinese word segmentation tree, and determining whether a Chinese word corresponding to the current word data to be identified exists in the Chinese word segmentation tree;
when the Chinese word corresponding to the current word data to be recognized exists in the Chinese word segmentation tree, determining whether the next Chinese character of the current Chinese character exists in the text data to be recognized according to the character sequence of the current Chinese character in the text data to be recognized;
extracting the next Chinese character in the text data to be recognized according to the character sequence when the next Chinese character exists in the text data to be recognized;
updating the word data to be identified according to the next Chinese character, and determining whether a Chinese word corresponding to the updated word data to be identified exists in the Chinese word segmentation tree;
when the Chinese word corresponding to the word data to be recognized does not exist in the Chinese word segmentation tree, acquiring one or more Chinese characters of the current Chinese character from the text data to be recognized according to the character sequence of the current Chinese character in the text data to be recognized, and generating a first recognized word according to the one or more Chinese characters of the current Chinese character;
and determining whether a next Chinese character of the current Chinese character exists in the text data to be recognized according to the character sequence of the current Chinese character in the text data to be recognized;
when the next Chinese character of the current Chinese character does not exist in the text data to be recognized, generating first word segmentation result data according to the first recognized word;
wherein, the Chinese words comprise word integrity marks; when the Chinese word corresponding to the current word data to be identified exists in the Chinese word segmentation tree, the method further comprises the following steps:
determining whether the word integrity mark of the current Chinese word is an integrity mark;
when the word integrity mark of the current Chinese word is the integrity mark, generating a word segmentation position mark for the position of the current Chinese character in the text data to be identified;
wherein after the generating first word segmentation result data from the first recognized word, the method further comprises:
determining whether the word segmentation position marks exist in the text data to be identified;
when the word segmentation position marks exist in the text data to be identified, new text data to be identified is intercepted from the text data to be identified according to the word segmentation position marks;
performing word matching on the new text data to be recognized according to the Chinese word segmentation tree to obtain a second recognized word;
and generating second word segmentation result data according to the second recognized words.
2. The method of claim 1, wherein the chinese terms in the chinese word segmentation tree include weight information.
3. The chinese word segmentation method of claim 2 wherein the first recognized word and the second recognized word comprise the weight information.
4. A method of chinese word segmentation as defined in claim 3 wherein, subsequent to the generating second word segmentation result data from the second recognized word, the method further comprises:
obtaining a score value of the first word segmentation result data according to the weight information of the first recognized words in the first word segmentation result data, and obtaining a score value of the second word segmentation result data according to the weight information of the second recognized words in the second word segmentation result data;
and determining optimal word segmentation result data from the first word segmentation result data and the second word segmentation result data according to the score value of the first word segmentation result data and the score value of the second word segmentation result data.
5. The chinese word segmentation method of claim 2 wherein prior to said bringing the word data to be identified into a chinese word segmentation tree, the method further comprises:
the processor in the semantic processing system trains the Chinese word segmentation tree according to Chinese words in a plurality of preset training databases;
the processor obtains weight information of the Chinese words according to the occurrence frequency of the Chinese words, so that the Chinese words in the Chinese word segmentation tree comprise the weight information.
6. The method of claim 5, wherein before the processor obtains the weight information of the chinese word according to the occurrence frequency of the chinese word, the method further comprises:
and determining the occurrence frequency of the Chinese words according to the application field priority information corresponding to the preset training database.
7. The method of claim 1, wherein determining whether a chinese word corresponding to the current word data to be identified exists in the chinese word segmentation tree comprises:
and the processor determines whether the Chinese word corresponding to the word data to be identified exists in the Chinese word segmentation tree according to the corresponding relation between each father node and each son node in the Chinese word segmentation tree.
CN201910231584.6A 2019-03-26 2019-03-26 Chinese word segmentation method Active CN109992776B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910231584.6A CN109992776B (en) 2019-03-26 2019-03-26 Chinese word segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910231584.6A CN109992776B (en) 2019-03-26 2019-03-26 Chinese word segmentation method

Publications (2)

Publication Number Publication Date
CN109992776A CN109992776A (en) 2019-07-09
CN109992776B true CN109992776B (en) 2023-07-25

Family

ID=67131447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910231584.6A Active CN109992776B (en) 2019-03-26 2019-03-26 Chinese word segmentation method

Country Status (1)

Country Link
CN (1) CN109992776B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544975B (en) * 2022-12-05 2023-03-10 济南丽阳神州智能科技有限公司 Log format conversion method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458694A (en) * 2008-10-09 2009-06-17 浙江大学 Chinese participle method based on tree thesaurus
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108710671B (en) * 2018-05-16 2020-06-05 北京金堤科技有限公司 Method and device for extracting company name in text
CN108986910B (en) * 2018-07-04 2023-09-05 平安科技(深圳)有限公司 On-line question and answer method, device, computer equipment and storage medium
CN109147793B (en) * 2018-08-17 2020-11-10 南京星邺汇捷网络科技有限公司 Voice data processing method, device and system

Also Published As

Publication number Publication date
CN109992776A (en) 2019-07-09

Similar Documents

Publication Publication Date Title
CN109388795B (en) Named entity recognition method, language recognition method and system
KR102163549B1 (en) Method and apparatus for determining retreat
US10796077B2 (en) Rule matching method and device
CN104503998B (en) For the kind identification method and device of user query sentence
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
US20140032207A1 (en) Information Classification Based on Product Recognition
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN110096572B (en) Sample generation method, device and computer readable medium
CN113590810B (en) Abstract generation model training method, abstract generation device and electronic equipment
CN111104803B (en) Semantic understanding processing method, device, equipment and readable storage medium
CN106528694A (en) Artificial intelligence-based semantic judgment processing method and apparatus
CN111680129B (en) Training method and system of semantic understanding system
CN109062977A (en) A kind of automatic question answering text matching technique, automatic question-answering method and system based on semantic similarity
WO2015043071A1 (en) Method and device for checking a translation
CN109992651B (en) Automatic identification and extraction method for problem target features
CN109992776B (en) Chinese word segmentation method
CN111354354B (en) Training method, training device and terminal equipment based on semantic recognition
CN109753646B (en) Article attribute identification method and electronic equipment
CN112101003B (en) Sentence text segmentation method, device and equipment and computer readable storage medium
CN116681056A (en) Text value calculation method and device based on value scale
CN110569504A (en) relation word determining method and device
US20110106849A1 (en) New case generation device, new case generation method, and new case generation program
CN113591004A (en) Game tag generation method and device, storage medium and electronic equipment
Rajagukguk et al. Interpretable Semantic Textual Similarity for Indonesian Sentence
CN111209752A (en) Chinese extraction integrated unsupervised abstract method based on auxiliary information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant