CN108845982A - A kind of Chinese word cutting method of word-based linked character - Google Patents
A kind of Chinese word cutting method of word-based linked character Download PDFInfo
- Publication number
- CN108845982A CN108845982A CN201711293044.8A CN201711293044A CN108845982A CN 108845982 A CN108845982 A CN 108845982A CN 201711293044 A CN201711293044 A CN 201711293044A CN 108845982 A CN108845982 A CN 108845982A
- Authority
- CN
- China
- Prior art keywords
- word
- candidate
- words
- corpus
- quaternary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 109
- 230000011218 segmentation Effects 0.000 claims abstract description 108
- 239000012634 fragment Substances 0.000 claims abstract description 11
- 238000009833 condensation Methods 0.000 claims description 25
- 230000005494 condensation Effects 0.000 claims description 25
- 238000001914 filtration Methods 0.000 claims description 16
- 230000002776 aggregation Effects 0.000 claims description 8
- 238000004220 aggregation Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000010365 information processing Effects 0.000 abstract description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of Chinese word cutting methods of word-based linked character, belong to technical field of information processing.The present invention selects text to be treated from text library, and pre-processes to text library, including removes symbol and form it into sentence, using removing the sentence builder corpus after symbol.Using the segmenting method of front and back splicing word, the corpus in step a1 is segmented, forms segmentation fragment.Using word splicing before and after binary cutting, word splicing before and after ternary cutting, word joining method before and after quaternary cutting forms binary candidate's dictionary, ternary candidate dictionary and quaternary candidate's dictionary.One word frequency thresholding is set to the candidate word for the word frequency that statistics has been got well, and it is made decisions, meets the reservation of this judgement, forms new corpus.
Description
Technical Field
The invention relates to a Chinese word segmentation method based on word association characteristics, and belongs to the technical field of information processing.
Background
Chinese word segmentation technology belongs to the field of natural language processing technology, and for a sentence, people can understand which words are and which are not words through own knowledge, but how can a computer understand? The processing process is the word segmentation algorithm.
Existing word segmentation algorithms can be divided into three major categories: an understanding-based word segmentation method, a character string matching-based word segmentation method and a traditional statistics-based word segmentation method.
The word segmentation method based on understanding achieves the effect of recognizing words by enabling a computer to simulate human understanding of sentences. The basic idea is to analyze syntax and semantics while segmenting words, and to process ambiguity phenomenon by using syntax information and semantic information. It generally comprises three parts: word segmentation subsystem, syntax semantic subsystem, and master control part. Under the coordination of the master control part, the word segmentation subsystem can obtain syntactic and semantic information of related words, sentences and the like to judge word segmentation ambiguity, namely the word segmentation subsystem simulates the process of understanding sentences by people. This word segmentation method requires the use of a large amount of linguistic knowledge and information. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by a machine, so that the existing understanding-based word segmentation system is still in a test stage.
A word segmentation method based on character string matching is also called a mechanical word segmentation method, and is characterized in that a Chinese character string to be analyzed is matched with a vocabulary entry in a sufficiently large machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is identified). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the condition of preferential matching of different lengths, the method can be divided into maximum (longest) matching and minimum (shortest) matching; whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. Several mechanical word segmentation methods are commonly used: (1) forward maximum matching (left to right direction); (2) inverse maximum matching (right-to-left direction); (3) least segmentation (minimizing the number of words cut in each sentence). The three mechanical word segmentation methods can also be combined with each other, for example, a forward maximum matching method and a reverse maximum matching method can be combined to form a bidirectional matching method. Because of the characteristic of Chinese single word formation, the forward minimum matching and the reverse minimum matching are generally rarely used. Generally speaking, the segmentation precision of the reverse matching is slightly higher than that of the forward matching, and the encountered ambiguity phenomenon is less. In the actually used word segmentation system, mechanical word segmentation is used as an initial segmentation means, and various other language information is used to further improve the accuracy of segmentation. One method of improvement is to improve the scanning mode, called feature scanning or mark segmentation, to identify and segment some words with obvious features in the character string to be analyzed, and to use these words as break points, to segment the original character string into smaller strings and then to perform mechanical segmentation, so as to reduce the matching error rate. Another improvement method is to combine word segmentation and part of speech tagging, to utilize rich part of speech information to provide help for word segmentation decision, and to check and adjust the word segmentation result in the tagging process, thereby greatly improving the segmentation accuracy.
In the above word segmentation method based on character string matching, that is, in the mechanical word segmentation method, whether it is a forward maximum matching method, a reverse maximum matching method, or a minimum segmentation method, the maximum matching method aims to try to maximize the matching length between each word and the word in the dictionary. The maximum matching method has the advantages of simple principle and easy realization, and has the defects that the maximum matching length is not easy to determine, if the maximum matching length is too large, the time complexity is increased, and if the maximum matching length is too small, some words exceeding the maximum matching length cannot be matched, so that the word segmentation accuracy is reduced. The evaluation principle of the maximum matching method is "long term first". However, the existing maximum matching method, whether forward or backward, increases or decreases characters, performs maximum matching in a local range, that is, the range of maximum matching is i characters first or i characters last each time, which does not fully embody the principle of "long word first".
The principle of the conventional statistical word segmentation method is that a word is a stable combination of words in form, so that the more times adjacent words appear simultaneously in the context, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the characters and the adjacent characters can better reflect the credibility of the words. The frequency of the combination of adjacent co-occurring words in the material can be counted to calculate their co-occurrence information. The co-occurrence information of two characters is defined, and the adjacent co-occurrence probability of two Chinese characters X, Y is calculated. The mutual-occurrence information embodies the closeness of the combination relationship between the Chinese characters. When the degree of closeness is above a certain threshold, it is considered that the word group may constitute a word. The method only needs to count the word group frequency in the corpus without dividing the dictionary, so the method is called a dictionary-free word segmentation method or a statistical word extraction method. However, this method also has a limitation in that some common word groups, which have a high co-occurrence frequency but are not words, such as "this", "one", "some", "my", "many", and the like, are often extracted, and the accuracy of recognition of common words is poor, and the space-time overhead is large.
Disclosure of Invention
The invention aims to provide a Chinese word segmentation method based on word association characteristics, which is used for overcoming the defect that words cannot be effectively identified and extracted from a large-scale corpus in the prior art and realizing the effective identification and extraction of the words in the large-scale corpus by a computer system.
The technical scheme of the invention is as follows: a Chinese word segmentation method based on word association characteristics comprises the following steps:
a. selecting a text to be processed from a text library, preprocessing the text library, including removing symbols and forming sentences, and constructing a corpus by using the sentences after symbol removal;
b. performing word segmentation on the corpus in the step 1 by adopting a word segmentation method of splicing words in front and back to form word segmentation fragments;
c. forming a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank by adopting a binary segmentation pre-post word splicing method, a ternary segmentation pre-post word splicing method and a quaternary segmentation pre-post word splicing method;
d. performing word frequency statistics on binary candidate words, ternary candidate words and quaternary candidate words in the binary candidate word library, the ternary candidate word library and the quaternary candidate word library;
e. setting a word frequency threshold for the candidate words with the counted word frequency, judging the word frequency threshold, reserving the candidate words meeting the threshold to form a new corpus, and deleting the candidate words if the candidate words do not meet the threshold;
f. calculating the degree of freedom and the degree of condensation of the candidate words in the corpus processed in the step 5, giving a uniform threshold of the degree of freedom and the degree of condensation of all the candidate words, judging, reserving the candidate words meeting the judgment, and deleting the candidate words if the candidate words do not meet the judgment;
g. and further filtering the screened ternary candidate words and quaternary candidate words by adopting a word segmentation filtering method to form a new word bank.
The word splicing method comprises the steps of continuously cutting and segmenting words from the first word in a Chinese sentence, and cutting all word-forming words out, and specifically comprises the following steps:
for the text content contained in a chinese text, assume that:
{ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+nin which aiExpressed as a character in the text, N belongs to N;
performing binary segmentation and splicing processing on the text set by adopting a binary segmentation pre-and post-word splicing method to obtain a processing result binary text fragment set, wherein the processing result binary text fragment set comprises the following steps: { (a)iai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Performing ternary segmentation and splicing processing on the text set by adopting a ternary segmentation pre-and-post word splicing method to obtain a processing result ternary text segment set, wherein the processing result is as follows: { (a)iai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Performing quaternary segmentation splicing processing on the text set by adopting a quaternary segmentation pre-and-post word splicing method to obtain a processing result quaternary text segment set, wherein the processing result quaternary text segment set comprises the following steps: { (a)iai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+ nai-2+nai-1+nai+n)}。
The degrees of freedom are: when a text segment appears in various different text sets and has a left adjacent word set and a right adjacent word set, wherein the left adjacent word set is a set indicating adjacent characters on the left side of the text segment, the right adjacent word set is a set indicating adjacent characters on the right side of the text segment, the information entropy of the text segment is obtained by calculating the information entropy of the left adjacent word set and the right adjacent word set, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
In the obtained text segment set, when a text segment can appear in various different text sets and has a left adjacent character set and a right adjacent character set, the information entropy H of the text segment is obtained by calculating the information entropy of the left adjacent character set and the information entropy of the right adjacent character set, that is, H is min { s ', s' } H represents the degree of freedom of the candidate word, S 'represents the right entropy of the candidate word, S' is the left entropy of the candidate word, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
The degree of condensation means that in one text, a new word appears alone with a higher probability than in combinationThe product of the probabilities of the words, i.e., P (AB) > P (A) P (B), letAnd taking the minimum M as the degree of condensation, wherein AB represents a new word, P (AB) represents the probability of the new word appearing in the text, A and B respectively represent a combined word, and P (A) and P (B) respectively represent the probability of the combined word appearing in the text.
The statistical degree of aggregation of the candidate words is obtained by calculating the ratio of the independent probability and the joint probability of the candidate words in the corpus, and the specific steps are as follows:
(1) obtaining the condensation degree M of the binary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words2:
Wherein M is2Representing degree of cohesion of candidate words, siProbability of occurrence of the first word of a binary candidate word in a corpus, NiRepresenting the number of occurrences of the first word of a binary candidate in a corpus, Ni+1Representing the number of occurrences of the second word of the binary candidate in the corpus, Ni,i+1Representing the number of occurrences of a binary candidate word in a corpus, representing the total word number in the corpus, si+1Representing the probability of the second character of the binary candidate word appearing in the corpus, and p (i, i +1) representing the probability of the binary candidate word appearing in the corpus;
(2) obtaining the condensation degree M of the ternary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words3:
Wherein M is3Representing degree of cohesion of the candidate word, SiOutline of appearance of first word in corpus for representing ternary candidate wordRate, si+1,i+2Probability, s, of the two last words of a ternary candidate word appearing in the corpus at the same timei,i+1Probability, s, of the first two words of a ternary candidate word appearing in the corpus at the same timei+2Representing the probability of the last word of a ternary candidate word appearing in the corpus, NiNumber of occurrences of the first word of the ternary candidate word in the corpus, Ni+2The number of times that the third word representing the ternary candidate word appears in the corpus, Ni+1,i+2The number of occurrences of the last two words of the ternary candidate word in the corpus, Ni,i+1Representing the number of occurrences of the first two words of a ternary candidate word in a corpus, Ni,i+1,i+2Representing the number of occurrences of a ternary candidate word in a corpus, P(i,i+1,i+2)Representing the probability of the occurrence of the ternary candidate words in the corpus;
(3) obtaining the condensation degree M of the quaternary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words4:
Wherein M is4Representing degree of cohesion of the candidate word, SiProbability of occurrence of the first word of the quaternary candidate word in the corpus, Si+1,i+2,i+3Probability of the last three words representing a quaternary candidate word appearing in the corpus at the same time, Si,i+1,i+2Representing the probability of the first three words of a quaternary candidate word appearing in the corpus at the same time, Si,i+1,Representing the probability of the first two words of a quaternary candidate word appearing in the corpus, Si+2,i+3Representing the probability of the last two words of a quaternary candidate word appearing in the corpus, NiNumber of occurrences of the first word of the quaternary candidate word in the corpus, Ni+3Number of times that a fourth word representing a quaternary candidate word appears in the corpus, Ni,i+1Representing the number of occurrences of the first two words of a quaternary candidate word in the corpus, Ni+2,i+3The number of occurrences of the last two words of the quaternary candidate word in the corpus, Ni+1,i+2,i+3The number of occurrences of the last three words of the quaternary candidate word in the corpus, Ni+1,i+2,i+3The number of times the first three words of the quaternary candidate word appear in the corpus is represented, and P (i, i +1, i +2, i +3) represents the probability of the quaternary candidate word appearing in the corpus.
The filtering method of the ternary candidate words comprises the following steps: for the ternary candidate word, if the last two characters exist in the binary candidate word library, judging whether the first character and the left adjacent character form a word, if the first two characters exist in the binary candidate word library, judging whether the last character and the right adjacent character form a word, and determining whether the ternary candidate word is a candidate word;
if it isThen(Ai-1AiAi+1)Belong to a three-element word;
wherein,(Ai-1AiAi+1)belonging to ternary candidate words, Ai-2Is a left-adjacent word set of ternary candidate words, Ai+2Is the right adjacent word of the ternary candidate word, { A0....Ai....ANIs a corpus character set, { (A)0,A1)...(Ai,Ai+1)...(Ai-2,Ai-1) Is a binary candidate word set.
The filtering method of the quaternary candidate words comprises the following steps: firstly, segmenting quaternary candidate words which are possible to become words, wherein the first two words are segmentation segments, the second two words are segmentation segments, the segmentation segments are respectively matched with the segmented binary word banks, the words in the matching process are used as pre-selected words, then, segmenting the middle two words of the quaternary words and are matched with the segmented binary word banks, the words which are not matched are used as the pre-selected words, and if the two conditions are met, the words are used as the result of segmentation:
and is
Wherein, { Ai-2,(Ai-1AiAi+1)Ai+2}. denotes a quaternary candidate, { (A)i-2,Ai-1),(Ai,Ai+1) Means the front and rear words divided by the four candidate words, { (A)0A1)...(AiAi+1)...(ANAN+1) Denotes a binary word library, { (A)i-1Ai) } denotes a quaternary candidate word intermediate word.
The invention has the beneficial effects that: the method provided by the invention has high correctness and effectiveness, the system can efficiently perform word segmentation, and the ternary and quaternary word segmentation method with the condensation degree and the degree of freedom designed by the invention can well solve the problems of the traditional word segmentation method based on a statistical model.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
Example 1: as shown in fig. 1, a chinese word segmentation method based on word association features:
a. selecting a text to be processed from a text library, preprocessing the text library, including removing symbols and forming sentences, and constructing a corpus by using the sentences after symbol removal;
b. performing word segmentation on the corpus in the step 1 by adopting a word segmentation method of splicing words in front and back to form word segmentation fragments;
c. forming a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank by adopting a binary segmentation pre-post word splicing method, a ternary segmentation pre-post word splicing method and a quaternary segmentation pre-post word splicing method;
d. performing word frequency statistics on binary candidate words, ternary candidate words and quaternary candidate words in the binary candidate word library, the ternary candidate word library and the quaternary candidate word library;
e. setting a word frequency threshold for the candidate words with the counted word frequency, judging the word frequency threshold, reserving the candidate words meeting the threshold to form a new corpus, and deleting the candidate words if the candidate words do not meet the threshold;
f. calculating the degree of freedom and the degree of condensation of the candidate words in the corpus processed in the step 5, giving a uniform threshold of the degree of freedom and the degree of condensation of all the candidate words, judging, reserving the candidate words meeting the judgment, and deleting the candidate words if the candidate words do not meet the judgment;
g. and further filtering the screened ternary candidate words and quaternary candidate words by adopting a word segmentation filtering method to form a new word bank.
The word splicing method comprises the steps of continuously cutting and segmenting words from the first word in a Chinese sentence, and cutting all word-forming words out, and specifically comprises the following steps:
for the text content contained in a chinese text, assume that:
{ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+nin which aiExpressed as a character in the text, N belongs to N;
performing binary segmentation and splicing processing on the text set by adopting a binary segmentation pre-and-post word splicing method to obtain a processing resultThe text segment set is as follows: { (a)iai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Performing ternary segmentation and splicing processing on the text set by adopting a ternary segmentation pre-and-post word splicing method to obtain a processing result ternary text segment set, wherein the processing result is as follows: { (a)iai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Performing quaternary segmentation splicing processing on the text set by adopting a quaternary segmentation pre-and-post word splicing method to obtain a processing result quaternary text segment set, wherein the processing result quaternary text segment set comprises the following steps: { (a)iai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+ nai-2+nai-1+nai+n)}。
The degrees of freedom are: when a text segment appears in various different text sets and has a left adjacent word set and a right adjacent word set, wherein the left adjacent word set is a set indicating adjacent characters on the left side of the text segment, the right adjacent word set is a set indicating adjacent characters on the right side of the text segment, the information entropy of the text segment is obtained by calculating the information entropy of the left adjacent word set and the right adjacent word set, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
In the obtained text segment set, the left adjacent word set of the text segment refers to: sets of characters adjacent to and appearing to the left of a text segment, e.g. text segment (a)i+1ai+2) In the text set { ai,ai+1,ai+2,ai+3,ai+4,ai+ 5.......ai-1+n,ai+nThe left adjacent word set in the Chinese character is { a }iText fragmentThe right set of neighbors of (1) refers to: sets of characters adjacent to and appearing to the right of a text segment, e.g. text segment (a)i+1ai+2) In the text set { ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+nThe right adjacent words in the Chinese character are set as { a }i+3}。
In the obtained text segment set, when a text segment can appear in various different text sets and has a left adjacent character set and a right adjacent character set, the information entropy H of the text segment is obtained by calculating the information entropy of the left adjacent character set and the information entropy of the right adjacent character set, that is, H is min { s ', s' } H represents the degree of freedom of the candidate word, S 'represents the right entropy of the candidate word, S' is the left entropy of the candidate word, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
And the degree of freedom of the candidate word is the degree of freedom of the candidate word by calculating and selecting the small entropy of the information entropy of the left and right adjacent word sets of the candidate word.
H=min{s′,s″}
H represents the degree of freedom of the candidate word, and s' represents the right entropy of the candidate word;
wherein, biRight neighbourhood belonging to candidate words, nbi denoting biThe frequency number appearing on the right side of the candidate word, k represents the number of character elements in the right adjacent word set of the candidate word, and s' is the left entropy of the candidate word;
where mi belongs to the left neighbor set of candidate words, nmiIndicating the frequency of mi appearing to the left of the candidate word, and M indicating the number of word elements in the left-adjacent word set of the candidate word.
The degree of condensation is the product of the probability that a new word appears alone in a text and the probability that the new word appears alone is higher than the probability that the new word is combined, i.e. P (AB) > P (A) P (B), so thatAnd taking the minimum M as the degree of condensation, wherein AB represents a new word, P (AB) represents the probability of the new word appearing in the text, A and B respectively represent a combined word, and P (A) and P (B) respectively represent the probability of the combined word appearing in the text.
The statistical degree of aggregation of the candidate words is obtained by calculating the ratio of the independent probability and the joint probability of the candidate words in the corpus, and the specific steps are as follows:
(1) obtaining the condensation degree M of the binary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words2:
Wherein M is2Representing degree of cohesion of candidate words, siRepresenting binary candidate wordsProbability of first word appearing in corpus, NiRepresenting the number of occurrences of the first word of a binary candidate in a corpus, Ni+1Representing the number of occurrences of the second word of the binary candidate in the corpus, Ni,i+1Representing the number of occurrences of a binary candidate word in the corpus, representing the total word count in the corpus, Si+1Representing the probability of the second character of the binary candidate word appearing in the corpus, and p (i, i +1) representing the probability of the binary candidate word appearing in the corpus;
(2) obtaining the condensation degree M of the ternary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words3:
Wherein M is3Representing degree of cohesion of the candidate word, SiProbability, s, of occurrence of the first word of a ternary candidate word in a corpusi+1,i+2Probability, s, of the two last words of a ternary candidate word appearing in the corpus at the same timei,i+1Probability, s, of the first two words of a ternary candidate word appearing in the corpus at the same timei+2Representing the probability of the last word of a ternary candidate word appearing in the corpus, NiRepresenting ternary candidate wordsNumber of occurrences of the first word in the corpus, Ni+2The number of times that the third word representing the ternary candidate word appears in the corpus, Ni+1,i+2The number of occurrences of the last two words of the ternary candidate word in the corpus, Ni,i+1Representing the number of occurrences of the first two words of a ternary candidate word in a corpus, Ni,i+1,i+2Representing the number of occurrences of a ternary candidate word in a corpus, P(i,i+1,i+2)Representing the probability of the occurrence of the ternary candidate words in the corpus;
(3) obtaining the condensation degree M of the quaternary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words4:
Wherein M is4Representing degree of cohesion of the candidate word, SiProbability of occurrence of the first word of the quaternary candidate word in the corpus, Si+1,i+2,i+3Probability of the last three words representing a quaternary candidate word appearing in the corpus at the same time, Si,i+1,i+2Representing the probability of the first three words of a quaternary candidate word appearing in the corpus at the same time, Si,i+1,Representing the probability of the first two words of a quaternary candidate word appearing in the corpus, Si+2,i+3Representing the probability of the last two words of a quaternary candidate word appearing in the corpus, NiNumber of occurrences of the first word of the quaternary candidate word in the corpus, Ni+3Number of times that a fourth word representing a quaternary candidate word appears in the corpus, Ni,i+1Representing the number of occurrences of the first two words of a quaternary candidate word in the corpus, Ni+2,i+3The number of occurrences of the last two words of the quaternary candidate word in the corpus, Ni+1,i+2,i+3The number of occurrences of the last three words of the quaternary candidate word in the corpus, Ni+1,i+2,i+3The number of times the first three words of the quaternary candidate word appear in the corpus is represented, and P (i, i +1, i +2, i +3) represents the probability of the quaternary candidate word appearing in the corpus.
The filtering method of the ternary candidate words comprises the following steps: for the ternary candidate word, if the last two characters exist in the binary candidate word library, judging whether the first character and the left adjacent character form a word, if the first two characters exist in the binary candidate word library, judging whether the last character and the right adjacent character form a word, and determining whether the ternary candidate word is a candidate word;
if it isThen (A)i-1AiAi+1) Belong to a three-element word;
wherein,(Ai-1AiAi+1)belonging to ternary candidate words, Ai-2Is a left-adjacent word set of ternary candidate words, Ai+2Is the right adjacent word of the ternary candidate word, { A0....Ai....AN.Is a character set in the corpus { (A)0,A1)...(Ai,Ai+1)...(Ai-2,Ai-1) Is a binary candidate word set.
The filtering method of the quaternary candidate words comprises the following steps: firstly, segmenting quaternary candidate words which are possible to become words, wherein the first two words are segmentation segments, the second two words are segmentation segments, the segmentation segments are respectively matched with the segmented binary word banks, the words in the matching process are used as pre-selected words, then, segmenting the middle two words of the quaternary words and are matched with the segmented binary word banks, the words which are not matched are used as the pre-selected words, and if the two conditions are met, the words are used as the result of segmentation:
and is
Wherein, { Ai-2,(Ai-1AiAi+1)Ai+2}. denotes a quaternary candidate, { (A)i-2,Ai-1),(Ai,Ai+1) Means the front and rear words divided by the four candidate words, { (A)0A1)...(AiAi+1)...(ANAN+1) Denotes a binary word library, { (A)i-1Ai) } denotes a quaternary candidate word intermediate word.
Example 2: as shown in fig. 1, a text to be processed is selected from a text library, and the text library is preprocessed, including removing symbols and forming sentences, and a corpus is constructed by using the words after symbol removal.
And c, segmenting the corpus in the step a1 by adopting a segmentation method of splicing words front and back to form segmentation fragments.
And a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank are formed by adopting a binary segmentation front-and-back word splicing method, a ternary segmentation front-and-back word splicing method and a quaternary segmentation front-and-back word splicing method.
And setting a word frequency threshold for the candidate words with the counted word frequency, judging the candidate words, and keeping the judgment to form a new corpus.
Counting the degree of condensation of the candidate words and the degree of freedom of the candidate words; in this embodiment, the statistical aggregation degree of the candidate words may be obtained by calculating an independent probability and a joint probability ratio of the candidate words in the corpus; the degree of freedom of the candidate word can be calculated and the minimum entropy of the left and right adjacent word sets of the candidate word is selected as the degree of freedom of the candidate word.
And comparing the degree of condensation of the candidate words and the degree of freedom of the candidate words with a set threshold value.
And extracting candidate words larger than a threshold value to serve as a candidate word library.
In this example, four well known small-sized lectures "journey to the West" were collected. In the counted word stock, if the distribution of the word-forming text segments is sufficient, the degree of cohesion is higher and the degree of freedom is larger relative to the non-word-forming segments. If the left and right adjacent words of a word are regarded as random variables, the information entropy of the left and right adjacent word sets of the word reflects the randomness of the left and right adjacent words of the word, the larger the entropy value is, the richer the left adjacent word set or the right adjacent word set of the word is, and the smaller entropy in the left and right adjacent word sets is taken as the degree of freedom.
In this embodiment, the degree of aggregation is higher for the fragment of the word formation, and the degree of relationship between the explanatory word and the word is higher, and when calculating the degree of aggregation, we take the smaller degree of aggregation as the final degree of aggregation.
The word bank counted by the degree of freedom and the degree of condensation is used as a candidate word bank, the ternary candidate word bank and the quaternary candidate word bank are processed by a ternary word segmentation filtering method and a quaternary word segmentation filtering method, and finally the word bank is obtained.
The ternary word segmentation filtering method and the quaternary word segmentation filtering method solve the problem that words are not words in reality from the subjective point of view, thereby improving the effectiveness of the ternary word bank and the quaternary word bank.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.
Claims (8)
1. A Chinese word segmentation method based on word association characteristics is characterized in that:
a. selecting a text to be processed from a text library, preprocessing the text library, including removing symbols and forming sentences, and constructing a corpus by using the sentences after symbol removal;
b. performing word segmentation on the corpus in the step 1 by adopting a word segmentation method of splicing words in front and back to form word segmentation fragments;
c. forming a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank by adopting a binary segmentation pre-post word splicing method, a ternary segmentation pre-post word splicing method and a quaternary segmentation pre-post word splicing method;
d. performing word frequency statistics on binary candidate words, ternary candidate words and quaternary candidate words in the binary candidate word library, the ternary candidate word library and the quaternary candidate word library;
e. setting a word frequency threshold for the candidate words with the counted word frequency, judging the word frequency threshold, reserving the candidate words meeting the threshold to form a new corpus, and deleting the candidate words if the candidate words do not meet the threshold;
f. calculating the degree of freedom and the degree of condensation of the candidate words in the corpus processed in the step 5, giving a uniform threshold of the degree of freedom and the degree of condensation of all the candidate words, judging, reserving the candidate words meeting the judgment, and deleting the candidate words if the candidate words do not meet the judgment;
g. and further filtering the screened ternary candidate words and quaternary candidate words by adopting a word segmentation filtering method to form a new word bank.
2. The method of claim 1, wherein the method comprises: the word splicing method comprises the steps of continuously cutting and segmenting words from the first word in a Chinese sentence, and cutting all word-forming words out, and specifically comprises the following steps:
for the text content contained in a chinese text, assume that:
{ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+nin which aiExpressed as a character in the text, N belongs to N;
performing binary segmentation and splicing processing on the text set by adopting a binary segmentation pre-and post-word splicing method to obtain a processing result binary text fragment set, wherein the processing result binary text fragment set comprises the following steps: { (a)iai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Text set by adopting ternary segmentation preceding and following word splicing methodPerforming ternary segmentation and splicing treatment on the combined text to obtain a processing result ternary text fragment set, wherein the processing result is as follows: { (a)iai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Performing quaternary segmentation splicing processing on the text set by adopting a quaternary segmentation pre-and-post word splicing method to obtain a processing result quaternary text segment set, wherein the processing result quaternary text segment set comprises the following steps: { (a)iai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+nai-2+ nai-1+nai+n)}。
3. The method of claim 1, wherein the method comprises: the degrees of freedom are: when a text segment appears in various different text sets and has a left adjacent word set and a right adjacent word set, wherein the left adjacent word set is a set indicating adjacent characters on the left side of the text segment, the right adjacent word set is a set indicating adjacent characters on the right side of the text segment, the information entropy of the text segment is obtained by calculating the information entropy of the left adjacent word set and the right adjacent word set, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
4. The method of claim 3, wherein the method comprises: the degree of freedom is that in the obtained text segment set, when one text segment can appear in various different text sets and has a left adjacent character set and a right adjacent character set, the information entropy H of the text segment is obtained by calculating the information entropy of the left adjacent character set and the right adjacent character set, that is, H is min { s ', s' },h represents the degree of freedom of the candidate word, S' represents the candidate wordAnd s' is the left entropy of the candidate word, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
5. The method of claim 1, wherein the method comprises: the degree of condensation is the product of the probability that a new word appears alone in a text and the probability that the new word appears alone is higher than the probability that the new word is combined, namely P (AB)>P (A) P (B), inAnd taking the minimum M as the degree of condensation, wherein AB represents a new word, P (AB) represents the probability of the new word appearing in the text, A and B respectively represent a combined word, and P (A) and P (B) respectively represent the probability of the combined word appearing in the text.
6. The method of claim 1, wherein the method comprises: the statistical degree of aggregation of the candidate words is obtained by calculating the ratio of the independent probability and the joint probability of the candidate words in the corpus, and the specific steps are as follows:
(1) obtaining the condensation degree M of the binary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words2:
Wherein M is2Representing degree of cohesion of candidate words, siProbability of occurrence of the first word of a binary candidate word in a corpus, NiRepresenting the number of occurrences of the first word of a binary candidate in a corpus, Ni+1Representing the number of occurrences of the second word of the binary candidate in the corpus, Ni,i+1Representing the number of occurrences of a binary candidate word in a corpus, representing the total word number in the corpus, si+1Representing the probability of the second character of the binary candidate word appearing in the corpus, and p (i, i +1) representing the probability of the binary candidate word appearing in the corpus;
(2) the Buddhist scriptureThe ratio of the probability of word selection and the combined probability of the candidate words obtains the degree of condensation M of the ternary candidate words3:
Wherein M is3Representing degree of cohesion of the candidate word, SiProbability, s, of occurrence of the first word of a ternary candidate word in a corpusi+1,i+2Probability, s, of the two last words of a ternary candidate word appearing in the corpus at the same timei,i+1Probability, s, of the first two words of a ternary candidate word appearing in the corpus at the same timei+2Representing the probability of the last word of a ternary candidate word appearing in the corpus, NiNumber of occurrences of the first word of the ternary candidate word in the corpus, Ni+2The number of times that the third word representing the ternary candidate word appears in the corpus, Ni+1,i+2The number of occurrences of the last two words of the ternary candidate word in the corpus, Ni,i+1Representing the number of occurrences of the first two words of a ternary candidate word in a corpus, Ni,i+1,i+2Representing the number of occurrences of a ternary candidate word in a corpus, P(i,i+1,i+2)Representing the probability of the occurrence of the ternary candidate words in the corpus;
(3) obtaining the condensation degree M of the quaternary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words4:
Wherein M is4Representing degree of cohesion of the candidate word, SiProbability of occurrence of the first word of the quaternary candidate word in the corpus, Si+1,i+2,i+3Probability of the last three words representing a quaternary candidate word appearing in the corpus at the same time, Si,i+1,i+2Representing the probability of the first three words of a quaternary candidate word appearing in the corpus at the same time, Si,i+1Representing the probability of the first two words of a quaternary candidate word appearing in the corpus, Si+2,i+3The last two words representing quaternary candidate words appearing in the corpusProbability, NiNumber of occurrences of the first word of the quaternary candidate word in the corpus, Ni+3Number of times that a fourth word representing a quaternary candidate word appears in the corpus, Ni,i+1Representing the number of occurrences of the first two words of a quaternary candidate word in the corpus, Ni+2,i+3The number of occurrences of the last two words of the quaternary candidate word in the corpus, Ni+1,i+2,i+3The number of occurrences of the last three words of the quaternary candidate word in the corpus, Ni+1,i+2,i+3The number of times the first three words of the quaternary candidate word appear in the corpus is represented, and P (i, i +1, i +2, i +3) represents the probability of the quaternary candidate word appearing in the corpus.
7. The method of claim 1, wherein the method comprises: the filtering method of the ternary candidate words comprises the following steps: for the ternary candidate word, if the last two characters exist in the binary candidate word library, judging whether the first character and the left adjacent character form a word, if the first two characters exist in the binary candidate word library, judging whether the last character and the right adjacent character form a word, and determining whether the ternary candidate word is a candidate word;
if it isThen (A)i-1AiAi+1) Belong to a three-element word;
wherein (A)i-1AiAi+1) Belonging to ternary candidate words, Ai-2Is a left-adjacent word set of ternary candidate words, Ai+2Is the right adjacent word of the ternary candidate word, { A0....Ai....ANIs a corpus character set, { (A)0,A1)...(Ai,Ai+1)...(Ai-2,Ai-1)Is a binary candidate word set.
8. The method of claim 1, wherein the method comprises: the filtering method of the quaternary candidate words comprises the following steps: firstly, segmenting quaternary candidate words which are possible to become words, wherein the first two words are segmentation segments, the second two words are segmentation segments, the segmentation segments are respectively matched with the segmented binary word banks, the words in the matching process are used as pre-selected words, then, segmenting the middle two words of the quaternary words and are matched with the segmented binary word banks, the words which are not matched are used as the pre-selected words, and if the two conditions are met, the words are used as the result of segmentation:
and is
Wherein, { Ai-2,(Ai-1AiAi+1)Ai+2}. denotes a quaternary candidate, { (A)i-2,Ai-1),(Ai,Ai+1) Means the front and rear words divided by the four candidate words, { (A)0A1)...(AiAi+1)...(ANAN+1) Denotes a binary word library, { (A)i-1Ai) } denotes a quaternary candidate word intermediate word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711293044.8A CN108845982B (en) | 2017-12-08 | 2017-12-08 | Chinese word segmentation method based on word association characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711293044.8A CN108845982B (en) | 2017-12-08 | 2017-12-08 | Chinese word segmentation method based on word association characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108845982A true CN108845982A (en) | 2018-11-20 |
CN108845982B CN108845982B (en) | 2021-08-20 |
Family
ID=64211732
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711293044.8A Active CN108845982B (en) | 2017-12-08 | 2017-12-08 | Chinese word segmentation method based on word association characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108845982B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858011A (en) * | 2018-11-30 | 2019-06-07 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN110287493A (en) * | 2019-06-28 | 2019-09-27 | 中国科学技术信息研究所 | Risk phrase chunking method, apparatus, electronic equipment and storage medium |
CN110287488A (en) * | 2019-06-18 | 2019-09-27 | 上海晏鼠计算机技术股份有限公司 | A kind of Chinese text segmenting method based on big data and Chinese feature |
CN110334345A (en) * | 2019-06-17 | 2019-10-15 | 首都师范大学 | New word discovery method |
CN110442861A (en) * | 2019-07-08 | 2019-11-12 | 万达信息股份有限公司 | A method of Chinese technical term and new word discovery based on real world statistics |
CN111125329A (en) * | 2019-12-18 | 2020-05-08 | 东软集团股份有限公司 | Text information screening method, device and equipment |
CN112711944A (en) * | 2021-01-13 | 2021-04-27 | 深圳前瞻资讯股份有限公司 | Word segmentation method and system and word segmentation device generation method and system |
CN116431930A (en) * | 2023-06-13 | 2023-07-14 | 天津联创科技发展有限公司 | Technological achievement conversion data query method, system, terminal and storage medium |
CN116541527A (en) * | 2023-07-05 | 2023-08-04 | 国网北京市电力公司 | Document classification method based on model integration and data expansion |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622341A (en) * | 2012-04-20 | 2012-08-01 | 北京邮电大学 | Domain ontology concept automatic-acquisition method based on Bootstrapping technology |
CN105488098A (en) * | 2015-10-28 | 2016-04-13 | 北京理工大学 | Field difference based new word extraction method |
CN105955950A (en) * | 2016-04-29 | 2016-09-21 | 乐视控股(北京)有限公司 | New word discovery method and device |
CN106126495A (en) * | 2016-06-16 | 2016-11-16 | 北京捷通华声科技股份有限公司 | A kind of based on large-scale corpus prompter method and apparatus |
CN107180025A (en) * | 2017-03-31 | 2017-09-19 | 北京奇艺世纪科技有限公司 | A kind of recognition methods of neologisms and device |
-
2017
- 2017-12-08 CN CN201711293044.8A patent/CN108845982B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622341A (en) * | 2012-04-20 | 2012-08-01 | 北京邮电大学 | Domain ontology concept automatic-acquisition method based on Bootstrapping technology |
CN105488098A (en) * | 2015-10-28 | 2016-04-13 | 北京理工大学 | Field difference based new word extraction method |
CN105955950A (en) * | 2016-04-29 | 2016-09-21 | 乐视控股(北京)有限公司 | New word discovery method and device |
CN106126495A (en) * | 2016-06-16 | 2016-11-16 | 北京捷通华声科技股份有限公司 | A kind of based on large-scale corpus prompter method and apparatus |
CN107180025A (en) * | 2017-03-31 | 2017-09-19 | 北京奇艺世纪科技有限公司 | A kind of recognition methods of neologisms and device |
Non-Patent Citations (1)
Title |
---|
王惠仙 等: "基于改进的正向最大匹配中文分词算法研究", 《贵州大学学报(自然科学版)》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858011A (en) * | 2018-11-30 | 2019-06-07 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN110334345A (en) * | 2019-06-17 | 2019-10-15 | 首都师范大学 | New word discovery method |
CN110287488A (en) * | 2019-06-18 | 2019-09-27 | 上海晏鼠计算机技术股份有限公司 | A kind of Chinese text segmenting method based on big data and Chinese feature |
CN110287493A (en) * | 2019-06-28 | 2019-09-27 | 中国科学技术信息研究所 | Risk phrase chunking method, apparatus, electronic equipment and storage medium |
CN110442861B (en) * | 2019-07-08 | 2023-04-07 | 万达信息股份有限公司 | Chinese professional term and new word discovery method based on real world statistics |
CN110442861A (en) * | 2019-07-08 | 2019-11-12 | 万达信息股份有限公司 | A method of Chinese technical term and new word discovery based on real world statistics |
CN111125329A (en) * | 2019-12-18 | 2020-05-08 | 东软集团股份有限公司 | Text information screening method, device and equipment |
CN111125329B (en) * | 2019-12-18 | 2023-07-21 | 东软集团股份有限公司 | Text information screening method, device and equipment |
CN112711944A (en) * | 2021-01-13 | 2021-04-27 | 深圳前瞻资讯股份有限公司 | Word segmentation method and system and word segmentation device generation method and system |
CN112711944B (en) * | 2021-01-13 | 2023-03-10 | 深圳前瞻资讯股份有限公司 | Word segmentation method and system, and word segmentation device generation method and system |
CN116431930A (en) * | 2023-06-13 | 2023-07-14 | 天津联创科技发展有限公司 | Technological achievement conversion data query method, system, terminal and storage medium |
CN116541527A (en) * | 2023-07-05 | 2023-08-04 | 国网北京市电力公司 | Document classification method based on model integration and data expansion |
CN116541527B (en) * | 2023-07-05 | 2023-09-29 | 国网北京市电力公司 | Document classification method based on model integration and data expansion |
Also Published As
Publication number | Publication date |
---|---|
CN108845982B (en) | 2021-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108845982B (en) | Chinese word segmentation method based on word association characteristics | |
CN105786991B (en) | In conjunction with the Chinese emotion new word identification method and system of user feeling expression way | |
CN106569995B (en) | Chinese ancient poetry word automatic generation method based on corpus and rules and forms rule | |
CN105022725B (en) | A kind of text emotion trend analysis method applied to finance Web fields | |
CN104572622B (en) | A kind of screening technique of term | |
CN112818694A (en) | Named entity recognition method based on rules and improved pre-training model | |
CN105095190B (en) | A kind of sentiment analysis method combined based on Chinese semantic structure and subdivision dictionary | |
CN111444330A (en) | Method, device and equipment for extracting short text keywords and storage medium | |
CN109271524B (en) | Entity linking method in knowledge base question-answering system | |
CN109492105B (en) | Text emotion classification method based on multi-feature ensemble learning | |
CN107818173B (en) | Vector space model-based Chinese false comment filtering method | |
CN110287314B (en) | Long text reliability assessment method and system based on unsupervised clustering | |
CN106598941A (en) | Algorithm for globally optimizing quality of text keywords | |
CN109947951A (en) | A kind of automatically updated emotion dictionary construction method for financial text analyzing | |
CN110134934A (en) | Text emotion analysis method and device | |
CN104598530B (en) | A kind of method that field term extracts | |
CN110929022A (en) | Text abstract generation method and system | |
CN111429184A (en) | User portrait extraction method based on text information | |
CN111061873B (en) | Multi-channel text classification method based on Attention mechanism | |
CN110287493B (en) | Risk phrase identification method and device, electronic equipment and storage medium | |
CN109299463B (en) | Emotion score calculation method and related equipment | |
CN112948527B (en) | Improved TextRank keyword extraction method and device | |
CN107038155A (en) | The extracting method of text feature is realized based on improved small-world network model | |
CN111858900B (en) | Method, device, equipment and storage medium for generating question semantic parsing rule template | |
CN112329449B (en) | Emotion analysis method based on emotion dictionary and Transformer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |