CN108845982B - Chinese word segmentation method based on word association characteristics - Google Patents

Chinese word segmentation method based on word association characteristics Download PDF

Info

Publication number
CN108845982B
CN108845982B CN201711293044.8A CN201711293044A CN108845982B CN 108845982 B CN108845982 B CN 108845982B CN 201711293044 A CN201711293044 A CN 201711293044A CN 108845982 B CN108845982 B CN 108845982B
Authority
CN
China
Prior art keywords
word
candidate
words
quaternary
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711293044.8A
Other languages
Chinese (zh)
Other versions
CN108845982A (en
Inventor
龙华
李康康
邵玉斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201711293044.8A priority Critical patent/CN108845982B/en
Publication of CN108845982A publication Critical patent/CN108845982A/en
Application granted granted Critical
Publication of CN108845982B publication Critical patent/CN108845982B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a Chinese word segmentation method based on word association characteristics, and belongs to the technical field of information processing. The invention selects the text to be processed from the text library, preprocesses the text library, including removing symbols and forming sentences, and constructs a corpus by using the sentences after symbol removal. And c, segmenting the corpus in the step a1 by adopting a segmentation method of splicing words front and back to form segmentation fragments. And a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank are formed by adopting a binary segmentation front-and-back word splicing method, a ternary segmentation front-and-back word splicing method and a quaternary segmentation front-and-back word splicing method. And setting a word frequency threshold for the candidate words with the counted word frequency, judging the candidate words, and keeping the judgment to form a new corpus.

Description

Chinese word segmentation method based on word association characteristics
Technical Field
The invention relates to a Chinese word segmentation method based on word association characteristics, and belongs to the technical field of information processing.
Background
Chinese word segmentation technology belongs to the field of natural language processing technology, and for a sentence, people can understand which words are and which are not words through own knowledge, but how can a computer understand? The processing process is the word segmentation algorithm.
Existing word segmentation algorithms can be divided into three major categories: an understanding-based word segmentation method, a character string matching-based word segmentation method and a traditional statistics-based word segmentation method.
The word segmentation method based on understanding achieves the effect of recognizing words by enabling a computer to simulate human understanding of sentences. The basic idea is to analyze syntax and semantics while segmenting words, and to process ambiguity phenomenon by using syntax information and semantic information. It generally comprises three parts: word segmentation subsystem, syntax semantic subsystem, and master control part. Under the coordination of the master control part, the word segmentation subsystem can obtain syntactic and semantic information of related words, sentences and the like to judge word segmentation ambiguity, namely the word segmentation subsystem simulates the process of understanding sentences by people. This word segmentation method requires the use of a large amount of linguistic knowledge and information. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by a machine, so that the existing understanding-based word segmentation system is still in a test stage.
A word segmentation method based on character string matching is also called a mechanical word segmentation method, and is characterized in that a Chinese character string to be analyzed is matched with a vocabulary entry in a sufficiently large machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is identified). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the condition of preferential matching of different lengths, the method can be divided into maximum (longest) matching and minimum (shortest) matching; whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. Several mechanical word segmentation methods are commonly used: (1) forward maximum matching (left to right direction); (2) inverse maximum matching (right-to-left direction); (3) least segmentation (minimizing the number of words cut in each sentence). The three mechanical word segmentation methods can also be combined with each other, for example, a forward maximum matching method and a reverse maximum matching method can be combined to form a bidirectional matching method. Because of the characteristic of Chinese single word formation, the forward minimum matching and the reverse minimum matching are generally rarely used. Generally speaking, the segmentation precision of the reverse matching is slightly higher than that of the forward matching, and the encountered ambiguity phenomenon is less. In the actually used word segmentation system, mechanical word segmentation is used as an initial segmentation means, and various other language information is used to further improve the accuracy of segmentation. One method of improvement is to improve the scanning mode, called feature scanning or mark segmentation, to identify and segment some words with obvious features in the character string to be analyzed, and to use these words as break points, to segment the original character string into smaller strings and then to perform mechanical segmentation, so as to reduce the matching error rate. Another improvement method is to combine word segmentation and part of speech tagging, to utilize rich part of speech information to provide help for word segmentation decision, and to check and adjust the word segmentation result in the tagging process, thereby greatly improving the segmentation accuracy.
In the above word segmentation method based on character string matching, that is, in the mechanical word segmentation method, whether it is a forward maximum matching method, a reverse maximum matching method, or a minimum segmentation method, the maximum matching method aims to try to maximize the matching length between each word and the word in the dictionary. The maximum matching method has the advantages of simple principle and easy realization, and has the defects that the maximum matching length is not easy to determine, if the maximum matching length is too large, the time complexity is increased, and if the maximum matching length is too small, some words exceeding the maximum matching length cannot be matched, so that the word segmentation accuracy is reduced. The evaluation principle of the maximum matching method is "long term first". However, the existing maximum matching method, whether forward or backward, increases or decreases characters, performs maximum matching in a local range, that is, the range of maximum matching is i characters first or i characters last each time, which does not fully embody the principle of "long word first".
The principle of the conventional statistical word segmentation method is that a word is a stable combination of words in form, so that the more times adjacent words appear simultaneously in the context, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the characters and the adjacent characters can better reflect the credibility of the words. The frequency of the combination of adjacent co-occurring words in the material can be counted to calculate their co-occurrence information. The co-occurrence information of two characters is defined, and the adjacent co-occurrence probability of two Chinese characters X, Y is calculated. The mutual-occurrence information embodies the closeness of the combination relationship between the Chinese characters. When the degree of closeness is above a certain threshold, it is considered that the word group may constitute a word. The method only needs to count the word group frequency in the corpus without dividing the dictionary, so the method is called a dictionary-free word segmentation method or a statistical word extraction method. However, this method also has a limitation in that some common word groups, which have a high co-occurrence frequency but are not words, such as "this", "one", "some", "my", "many", and the like, are often extracted, and the accuracy of recognition of common words is poor, and the space-time overhead is large.
Disclosure of Invention
The invention aims to provide a Chinese word segmentation method based on word association characteristics, which is used for overcoming the defect that words cannot be effectively identified and extracted from a large-scale corpus in the prior art and realizing the effective identification and extraction of the words in the large-scale corpus by a computer system.
The technical scheme of the invention is as follows: a Chinese word segmentation method based on word association characteristics comprises the following steps:
a. selecting a text to be processed from a text library, preprocessing the text library, including removing symbols and forming sentences, and constructing a corpus by using the sentences after symbol removal;
b. performing word segmentation on the corpus in the step 1 by adopting a word segmentation method of splicing words in front and back to form word segmentation fragments;
c. forming a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank by adopting a binary segmentation pre-post word splicing method, a ternary segmentation pre-post word splicing method and a quaternary segmentation pre-post word splicing method;
d. performing word frequency statistics on binary candidate words, ternary candidate words and quaternary candidate words in the binary candidate word library, the ternary candidate word library and the quaternary candidate word library;
e. setting a word frequency threshold for the candidate words with the counted word frequency, judging the word frequency threshold, reserving the candidate words meeting the threshold to form a new corpus, and deleting the candidate words if the candidate words do not meet the threshold;
f. calculating the degree of freedom and the degree of condensation of the candidate words in the corpus processed in the step 5, giving a uniform threshold of the degree of freedom and the degree of condensation of all the candidate words, judging, reserving the candidate words meeting the judgment, and deleting the candidate words if the candidate words do not meet the judgment;
g. and further filtering the screened ternary candidate words and quaternary candidate words by adopting a word segmentation filtering method to form a new word bank.
The word splicing method comprises the steps of continuously cutting and segmenting words from the first word in a Chinese sentence, and cutting all word-forming words out, and specifically comprises the following steps:
for the text content contained in a chinese text, assume that:
{ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+nin which aiExpressed as a character in the text, N belongs to N;
performing binary segmentation and splicing processing on the text set by adopting a binary segmentation pre-and post-word splicing method to obtain a processing result binary text fragment set, wherein the processing result binary text fragment set comprises the following steps: { (a)iai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Performing ternary segmentation and splicing processing on the text set by adopting a ternary segmentation pre-and-post word splicing method to obtain a processing result ternary text segment set, wherein the processing result is as follows: { (a)iai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Performing quaternary segmentation splicing processing on the text set by adopting a quaternary segmentation pre-and-post word splicing method to obtain a processing result quaternary text segment set, wherein the processing result quaternary text segment set comprises the following steps: { (a)iai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+ nai-2+nai-1+nai+n)}。
The degrees of freedom are: when a text segment appears in various different text sets and has a left adjacent word set and a right adjacent word set, wherein the left adjacent word set is a set indicating adjacent characters on the left side of the text segment, the right adjacent word set is a set indicating adjacent characters on the right side of the text segment, the information entropy of the text segment is obtained by calculating the information entropy of the left adjacent word set and the right adjacent word set, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
In the obtained text segment set, when a text segment can appear in various different text sets and has a left adjacent character set and a right adjacent character set, the information entropy H of the text segment is obtained by calculating the information entropy of the left adjacent character set and the information entropy of the right adjacent character set, that is, H is min { s ', s' }
Figure RE-GDA0001829573390000041
Figure RE-GDA0001829573390000042
H represents the degree of freedom of the candidate word, S 'represents the right entropy of the candidate word, S' is the left entropy of the candidate word, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
The degree of condensation is the product of the probability that a new word appears alone in a text and is higher than the probability of the combined word, namely P (AB) > P (A) P (B), so that
Figure RE-GDA0001829573390000043
Taking the minimum M as the degree of condensation, wherein AB represents a new word, P (AB) represents the probability of the new word appearing in the text, A and B respectively represent a combined word, and P (A) and P (B) respectively represent the probability of the combined word appearing in the text.
The statistical degree of aggregation of the candidate words is obtained by calculating the ratio of the independent probability and the joint probability of the candidate words in the corpus, and the specific steps are as follows:
(1) obtaining the condensation degree M of the binary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words2
Figure BDA0001499742860000041
Wherein M is2Representing degree of cohesion of candidate words, siProbability of occurrence of the first word of a binary candidate word in a corpus, NiRepresenting the number of occurrences of the first word of a binary candidate in a corpus, Ni+1Representing the number of occurrences of the second word of the binary candidate in the corpus, Ni,i+1Representing the number of occurrences of a binary candidate word in a corpus, representing the total word number in the corpus, si+1Representing the probability of the second character of the binary candidate word appearing in the corpus, and p (i, i +1) representing the probability of the binary candidate word appearing in the corpus;
(2) obtaining the condensation degree M of the ternary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words3
Figure BDA0001499742860000042
Wherein M is3Representing degree of cohesion of the candidate word, SiProbability, s, of occurrence of the first word of a ternary candidate word in a corpusi+1,i+2Probability, s, of the two last words of a ternary candidate word appearing in the corpus at the same timei,i+1Probability, s, of the first two words of a ternary candidate word appearing in the corpus at the same timei+2Representing the probability of the last word of a ternary candidate word appearing in the corpus, NiNumber of occurrences of the first word of the ternary candidate word in the corpus, Ni+2The number of times that the third word representing the ternary candidate word appears in the corpus, Ni+1,i+2The number of occurrences of the last two words of the ternary candidate word in the corpus, Ni,i+1Representing the number of occurrences of the first two words of a ternary candidate word in a corpus, Ni,i+1,i+2Representing the number of occurrences of a ternary candidate word in a corpus, P(i,i+1,i+2)Representing the probability of the occurrence of the ternary candidate words in the corpus;
(3) obtaining the condensation degree M of the quaternary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words4
Figure BDA0001499742860000043
Wherein M is4Representing degree of cohesion of the candidate word, SiProbability of occurrence of the first word of the quaternary candidate word in the corpus, Si+1,i+2,i+3Probability of the last three words representing a quaternary candidate word appearing in the corpus at the same time, Si,i+1,i+2Representing the probability of the first three words of a quaternary candidate word appearing in the corpus at the same time, Si,i+1,Representing the probability of the first two words of a quaternary candidate word appearing in the corpus, Si+2,i+3Representing the probability of the last two words of a quaternary candidate word appearing in the corpus, NiNumber of occurrences of the first word of the quaternary candidate word in the corpus, Ni+3Number of times that a fourth word representing a quaternary candidate word appears in the corpus, Ni,i+1Representing the first two of the quaternary candidate wordsNumber of times a word appears in a corpus, Ni+2,i+3The number of occurrences of the last two words of the quaternary candidate word in the corpus, Ni+1,i+2,i+3The number of occurrences of the last three words of the quaternary candidate word in the corpus, Ni+1,i+2,i+3The number of times the first three words of the quaternary candidate word appear in the corpus is represented, and P (i, i +1, i +2, i +3) represents the probability of the quaternary candidate word appearing in the corpus.
The filtering method of the ternary candidate words comprises the following steps: for the ternary candidate word, if the last two characters exist in the binary candidate word library, judging whether the first character and the left adjacent character form a word, if the first two characters exist in the binary candidate word library, judging whether the last character and the right adjacent character form a word, and determining whether the ternary candidate word is a candidate word;
if it is
Figure BDA0001499742860000051
Then(Ai-1AiAi+1)Belong to a three-element word;
wherein,(Ai-1AiAi+1)belonging to ternary candidate words, Ai-2Is a left-adjacent word set of ternary candidate words, Ai+2Is the right adjacent word of the ternary candidate word, { A0....Ai....ANIs a corpus character set, { (A)0,A1)...(Ai,Ai+1)...(Ai-2,Ai-1) Is a binary candidate word set.
The filtering method of the quaternary candidate words comprises the following steps: firstly, segmenting quaternary candidate words which are possible to become words, wherein the first two words are segmentation segments, the second two words are segmentation segments, the segmentation segments are respectively matched with the segmented binary word banks, the words in the matching process are used as pre-selected words, then, segmenting the middle two words of the quaternary words and are matched with the segmented binary word banks, the words which are not matched are used as the pre-selected words, and if the two conditions are met, the words are used as the result of segmentation:
Figure BDA0001499742860000052
and is
Figure BDA0001499742860000053
Wherein, { Ai-2,(Ai-1AiAi+1)Ai+2}. denotes a quaternary candidate, { (A)i-2,Ai-1),(Ai,Ai+1) Means the front and rear words divided by the four candidate words, { (A)0A1)...(AiAi+1)...(ANAN+1) Denotes a binary word library, { (A)i-1Ai) } denotes a quaternary candidate word intermediate word.
The invention has the beneficial effects that: the method provided by the invention has high correctness and effectiveness, the system can efficiently perform word segmentation, and the ternary and quaternary word segmentation method with the condensation degree and the degree of freedom designed by the invention can well solve the problems of the traditional word segmentation method based on a statistical model.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
Example 1: as shown in fig. 1, a chinese word segmentation method based on word association features:
a. selecting a text to be processed from a text library, preprocessing the text library, including removing symbols and forming sentences, and constructing a corpus by using the sentences after symbol removal;
b. performing word segmentation on the corpus in the step 1 by adopting a word segmentation method of splicing words in front and back to form word segmentation fragments;
c. forming a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank by adopting a binary segmentation pre-post word splicing method, a ternary segmentation pre-post word splicing method and a quaternary segmentation pre-post word splicing method;
d. performing word frequency statistics on binary candidate words, ternary candidate words and quaternary candidate words in the binary candidate word library, the ternary candidate word library and the quaternary candidate word library;
e. setting a word frequency threshold for the candidate words with the counted word frequency, judging the word frequency threshold, reserving the candidate words meeting the threshold to form a new corpus, and deleting the candidate words if the candidate words do not meet the threshold;
f. calculating the degree of freedom and the degree of condensation of the candidate words in the corpus processed in the step 5, giving a uniform threshold of the degree of freedom and the degree of condensation of all the candidate words, judging, reserving the candidate words meeting the judgment, and deleting the candidate words if the candidate words do not meet the judgment;
g. and further filtering the screened ternary candidate words and quaternary candidate words by adopting a word segmentation filtering method to form a new word bank.
The word splicing method comprises the steps of continuously cutting and segmenting words from the first word in a Chinese sentence, and cutting all word-forming words out, and specifically comprises the following steps:
for the text content contained in a chinese text, assume that:
{ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+nin which aiExpressed as a character in the text, N belongs to N;
performing binary segmentation and splicing processing on the text set by adopting a binary segmentation pre-and post-word splicing method to obtain a processing result binary text fragment set, wherein the processing result binary text fragment set comprises the following steps: { (a)iai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Performing ternary segmentation and splicing processing on the text set by adopting a ternary segmentation pre-and-post word splicing method to obtain a processing result ternary text segment set, wherein the processing result is as follows: { (a)iai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),ai+5.......(ai-1+nai+n)};
Performing quaternary segmentation splicing processing on the text set by adopting a quaternary segmentation pre-and-post word splicing method to obtain a processing result quaternary text segment set, wherein the processing result quaternary text segment set comprises the following steps: { (a)iai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+ nai-2+nai-1+nai+n)}。
The degrees of freedom are: when a text segment appears in various different text sets and has a left adjacent word set and a right adjacent word set, wherein the left adjacent word set is a set indicating adjacent characters on the left side of the text segment, the right adjacent word set is a set indicating adjacent characters on the right side of the text segment, the information entropy of the text segment is obtained by calculating the information entropy of the left adjacent word set and the right adjacent word set, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
In the obtained text segment set, the left adjacent word set of the text segment refers to: sets of characters adjacent to and appearing to the left of a text segment, e.g. text segment (a)i+1ai+2) In the text set { ai,ai+1,ai+2,ai+3,ai+4,ai+ 5.......ai-1+n,ai+nThe left adjacent word set in the Chinese character is { a }iAnd the right adjacent character set of the text fragment refers to: sets of characters adjacent to and appearing to the right of a text segment, e.g. text segment (a)i+1ai+2) In the text set { ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+nThe right adjacent words in the Chinese character are set as { a }i+3}。
In the obtained text segment set, when a text segment can appear in various different text sets and has a left adjacent character set and a right adjacent character set, the information entropy H of the text segment is obtained by calculating the information entropy of the left adjacent character set and the information entropy of the right adjacent character set, that is, H is min { s ', s' }
Figure RE-GDA0001829573390000071
Figure RE-GDA0001829573390000072
H represents the degree of freedom of the candidate word, S 'represents the right entropy of the candidate word, S' is the left entropy of the candidate word, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
And the degree of freedom of the candidate word is the degree of freedom of the candidate word by calculating and selecting the small entropy of the information entropy of the left and right adjacent word sets of the candidate word.
H=min{s′,s″}
H represents the degree of freedom of the candidate word, and s' represents the right entropy of the candidate word;
Figure BDA0001499742860000072
wherein, biRight neighbourhood belonging to candidate words, nbi denoting biThe frequency number appearing on the right side of the candidate word, k represents the number of character elements in the right adjacent word set of the candidate word, and s' is the left entropy of the candidate word;
Figure BDA0001499742860000073
where mi belongs to the left neighbor set of candidate words, nmiIndicating the frequency of mi appearing to the left of the candidate word, and M indicating the number of word elements in the left-adjacent word set of the candidate word.
The degree of condensation is the product of the probability that a new word appears alone in a text and is higher than the probability of the combined word, namely P (AB) > P (A) P (B), so that
Figure RE-GDA0001829573390000083
Taking the minimum M as the degree of condensation, wherein AB represents a new word, P (AB) represents the probability of the new word appearing in the text, A and B respectively represent a combined word, and P (A) and P (B) respectively represent the probability of the combined word appearing in the text.
The statistical degree of aggregation of the candidate words is obtained by calculating the ratio of the independent probability and the joint probability of the candidate words in the corpus, and the specific steps are as follows:
(1) obtaining the condensation degree M of the binary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words2
Figure BDA0001499742860000081
Figure BDA0001499742860000082
Figure BDA0001499742860000083
Figure BDA0001499742860000084
Wherein M is2Representing degree of cohesion of candidate words, siProbability of occurrence of the first word of a binary candidate word in a corpus, NiRepresenting the number of occurrences of the first word of a binary candidate in a corpus, Ni+1Representing the number of occurrences of the second word of the binary candidate in the corpus, Ni,i+1Representing the number of occurrences of a binary candidate word in the corpus, representing the total word count in the corpus, Si+1Representing the probability of the second character of the binary candidate word appearing in the corpus, and p (i, i +1) representing the probability of the binary candidate word appearing in the corpus;
(2) obtaining the condensation degree M of the ternary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words3
Figure BDA0001499742860000085
Figure BDA0001499742860000086
Figure BDA0001499742860000087
Figure BDA0001499742860000088
Figure BDA0001499742860000089
Wherein M is3Representing degree of cohesion of the candidate word, SiProbability, s, of occurrence of the first word of a ternary candidate word in a corpusi+1,i+2Probability, s, of the two last words of a ternary candidate word appearing in the corpus at the same timei,i+1Probability, s, of the first two words of a ternary candidate word appearing in the corpus at the same timei+2Representing the probability of the last word of a ternary candidate word appearing in the corpus, NiNumber of occurrences of the first word of the ternary candidate word in the corpus, Ni+2The number of times that the third word representing the ternary candidate word appears in the corpus, Ni+1,i+2The number of occurrences of the last two words of the ternary candidate word in the corpus, Ni,i+1Representing the number of occurrences of the first two words of a ternary candidate word in a corpus, Ni,i+1,i+2Representing the number of occurrences of a ternary candidate word in a corpus, P(i,i+1,i+2)Representing the probability of the occurrence of the ternary candidate words in the corpus;
(3) obtaining the condensation degree M of the quaternary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words4
Figure BDA0001499742860000091
Figure BDA0001499742860000092
Figure BDA0001499742860000093
Figure BDA0001499742860000094
Figure BDA0001499742860000095
Figure BDA0001499742860000096
Figure BDA0001499742860000097
Figure BDA0001499742860000098
Wherein M is4Representing degree of cohesion of the candidate word, SiProbability of occurrence of the first word of the quaternary candidate word in the corpus, Si+1,i+2,i+3Probability of the last three words representing a quaternary candidate word appearing in the corpus at the same time, Si,i+1,i+2Representing the probability of the first three words of a quaternary candidate word appearing in the corpus at the same time, Si,i+1,Representing the probability of the first two words of a quaternary candidate word appearing in the corpus, Si+2,i+3Representing the probability of the last two words of a quaternary candidate word appearing in the corpus, NiNumber of occurrences of the first word of the quaternary candidate word in the corpus, Ni+3Number of times that a fourth word representing a quaternary candidate word appears in the corpus, Ni,i+1Representing the number of occurrences of the first two words of a quaternary candidate word in the corpus, Ni+2,i+3The number of occurrences of the last two words of the quaternary candidate word in the corpus, Ni+1,i+2,i+3Second three words in the corpus representing quaternary candidate wordsNumber, Ni+1,i+2,i+3The number of times the first three words of the quaternary candidate word appear in the corpus is represented, and P (i, i +1, i +2, i +3) represents the probability of the quaternary candidate word appearing in the corpus.
The filtering method of the ternary candidate words comprises the following steps: for the ternary candidate word, if the last two characters exist in the binary candidate word library, judging whether the first character and the left adjacent character form a word, if the first two characters exist in the binary candidate word library, judging whether the last character and the right adjacent character form a word, and determining whether the ternary candidate word is a candidate word;
if it is
Figure BDA0001499742860000101
Then (A)i-1AiAi+1) Belong to a three-element word;
wherein,(Ai-1AiAi+1)belonging to ternary candidate words, Ai-2Is a left-adjacent word set of ternary candidate words, Ai+2Is the right adjacent word of the ternary candidate word, { A0....Ai....AN.Is a character set in the corpus { (A)0,A1)...(Ai,Ai+1)...(Ai-2,Ai-1) Is a binary candidate word set.
The filtering method of the quaternary candidate words comprises the following steps: firstly, segmenting quaternary candidate words which are possible to become words, wherein the first two words are segmentation segments, the second two words are segmentation segments, the segmentation segments are respectively matched with the segmented binary word banks, the words in the matching process are used as pre-selected words, then, segmenting the middle two words of the quaternary words and are matched with the segmented binary word banks, the words which are not matched are used as the pre-selected words, and if the two conditions are met, the words are used as the result of segmentation:
Figure BDA0001499742860000102
and is
Figure BDA0001499742860000103
Wherein, { Ai-2,(Ai-1AiAi+1)Ai+2}. denotes a quaternary candidate, { (A)i-2,Ai-1),(Ai,Ai+1) Means the front and rear words divided by the four candidate words, { (A)0A1)...(AiAi+1)...(ANAN+1) Denotes a binary word library, { (A)i-1Ai) } denotes a quaternary candidate word intermediate word.
Example 2: as shown in fig. 1, a text to be processed is selected from a text library, and the text library is preprocessed, including removing symbols and forming sentences, and a corpus is constructed by using the words after symbol removal.
And c, segmenting the corpus in the step a1 by adopting a segmentation method of splicing words front and back to form segmentation fragments.
And a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank are formed by adopting a binary segmentation front-and-back word splicing method, a ternary segmentation front-and-back word splicing method and a quaternary segmentation front-and-back word splicing method.
And setting a word frequency threshold for the candidate words with the counted word frequency, judging the candidate words, and keeping the judgment to form a new corpus.
Counting the degree of condensation of the candidate words and the degree of freedom of the candidate words; in this embodiment, the statistical aggregation degree of the candidate words may be obtained by calculating an independent probability and a joint probability ratio of the candidate words in the corpus; the degree of freedom of the candidate word can be calculated and the minimum entropy of the left and right adjacent word sets of the candidate word is selected as the degree of freedom of the candidate word.
And comparing the degree of condensation of the candidate words and the degree of freedom of the candidate words with a set threshold value.
And extracting candidate words larger than a threshold value to serve as a candidate word library.
In this example, four well known small-sized lectures "journey to the West" were collected. In the counted word stock, if the distribution of the word-forming text segments is sufficient, the degree of cohesion is higher and the degree of freedom is larger relative to the non-word-forming segments. If the left and right adjacent words of a word are regarded as random variables, the information entropy of the left and right adjacent word sets of the word reflects the randomness of the left and right adjacent words of the word, the larger the entropy value is, the richer the left adjacent word set or the right adjacent word set of the word is, and the smaller entropy in the left and right adjacent word sets is taken as the degree of freedom.
In this embodiment, the degree of aggregation is higher for the fragment of the word formation, and the degree of relationship between the explanatory word and the word is higher, and when calculating the degree of aggregation, we take the smaller degree of aggregation as the final degree of aggregation.
The word bank counted by the degree of freedom and the degree of condensation is used as a candidate word bank, the ternary candidate word bank and the quaternary candidate word bank are processed by a ternary word segmentation filtering method and a quaternary word segmentation filtering method, and finally the word bank is obtained.
The ternary word segmentation filtering method and the quaternary word segmentation filtering method solve the problem that words are not words in reality from the subjective point of view, thereby improving the effectiveness of the ternary word bank and the quaternary word bank.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (7)

1. A Chinese word segmentation method based on word association characteristics is characterized in that:
a. selecting a text to be processed from a text library, preprocessing the text library, including removing symbols and forming sentences, and constructing a corpus by using the sentences after symbol removal;
b. b, segmenting words in the corpus in the step a by adopting a word segmentation method of splicing words front and back to form word segmentation fragments;
c. generating a binary candidate word library for word fragments by adopting a binary segmentation pre-and-post word splicing method, generating a ternary candidate word library for a ternary segmentation pre-and-post word splicing method, and generating a quaternary candidate word library for a quaternary segmentation pre-and-post word splicing method;
d. respectively carrying out word frequency statistics on binary candidate words, ternary candidate words and quaternary candidate words in the binary candidate word library, the ternary candidate word library and the quaternary candidate word library;
e. setting a word frequency threshold for the candidate words with the counted word frequency, judging, reserving the candidate words meeting the word frequency threshold, forming a new corpus, and deleting the candidate words not meeting the word frequency threshold;
f. e, calculating the degree of freedom and the degree of condensation of the candidate words in the corpus processed in the step e, giving a uniform degree of freedom threshold and degree of condensation threshold of all the candidate words, judging, reserving the candidate words meeting the degree of freedom threshold and the degree of condensation threshold, and deleting the candidate words not meeting the degree of freedom threshold and the degree of condensation threshold;
g. further filtering the screened ternary candidate words and quaternary candidate words by adopting a word segmentation filtering method to form a new word bank;
the method for calculating the degree of cohesion of the candidate words is obtained by calculating the ratio of the probability of the candidate words in the corpus to the combined probability, and comprises the following specific steps:
(1) obtaining the binary candidate word condensation degree M according to the ratio of the probability of the binary candidate words to the combination probability of the binary candidate words2
Figure FDA0003122130300000011
Wherein S is2iProbability of occurrence of the first word of a binary candidate word in a corpus, S2i+1Representing the probability of the second word of the binary candidate word appearing in the corpus, and p (2i,2i +1) representing the probability of the binary candidate word appearing in the corpus;
(2) obtaining the condensation degree M of the ternary candidate words according to the ratio of the probability of the ternary candidate words to the combined probability of the ternary candidate words3
Figure FDA0003122130300000012
Wherein S is3iProbability of occurrence of the first word of the ternary candidate word in the corpus, S3i+1,3i+2Probability of occurrence of the last two words of the ternary candidate word in the corpus at the same time, S3i,3i+1Representing the probability of the first two words of a ternary candidate word appearing in the corpus at the same time, S3i+2Representing the probability of the last character of the ternary candidate word appearing in the corpus, and P (3i,3i +1,3i +2) representing the probability of the ternary candidate word appearing in the corpus;
(3) obtaining the condensation degree M of the quaternary candidate words according to the ratio of the probability of the quaternary candidate words to the combined probability of the quaternary candidate words4
Figure FDA0003122130300000021
Wherein S is4iProbability of occurrence of the first word of the quaternary candidate word in the corpus, S4i+1,4i+2,4i+3Probability of the last three words representing a quaternary candidate word appearing in the corpus at the same time, S4i,4i+1,4i+2Representing the probability of the first three words of a quaternary candidate word appearing in the corpus at the same time, S4i,4i+1Representing the probability of the first two words of a quaternary candidate word appearing in the corpus, S4i+2,4i+3Representing the probability of the last two words of the quaternary candidate word appearing in the corpus,P(4i,4i+1,4i+2,4i+3)representing the probability of the quaternary candidate word appearing in the corpus.
2. The method of claim 1, wherein the method comprises: the word segmentation method for words spliced in front and back is to perform continuous segmentation on a Chinese sentence from a first word and segment all word-forming words of the Chinese sentence, and specifically comprises the following steps:
for the text content contained in a chinese text, assume that:
text set { ai,ai+1,ai+2,ai+3,ai+4,ai+5.......ai-1+n,ai+nIn which aiIs expressed as textOne character in the text, N belongs to N;
performing binary segmentation and splicing processing on the text set by adopting a binary segmentation pre-and post-word splicing method to obtain a processing result binary text fragment set, wherein the processing result binary text fragment set comprises the following steps: { (a)iai+1),(ai+1ai+2),(ai+2ai+3),(ai+3ai+4),(ai+4ai+5)......(ai-1+nai+n)};
Performing ternary segmentation and splicing processing on the text set by adopting a ternary segmentation pre-and-post word splicing method to obtain a processing result ternary text segment set, wherein the processing result is as follows: { (a)iai+1ai+2),(ai+1ai+2ai+3),(ai+2ai+3ai+4),(ai+3ai+4ai+5)......(ai-2+nai-1+nai+n)};
Performing quaternary segmentation splicing processing on the text set by adopting a quaternary segmentation pre-and-post word splicing method to obtain a processing result quaternary text segment set, wherein the processing result quaternary text segment set comprises the following steps: { (a)iai+1ai+2ai+3),(ai+1ai+2ai+3ai+4),(ai+2ai+3ai+4ai+5).......(ai-3+nai-2+ nai-1+nai+n)}。
3. The method of claim 1, wherein the method comprises: the degrees of freedom are: when a text segment appears in various different text sets and has a left adjacent word set and a right adjacent word set, wherein the left adjacent word set is a set indicating adjacent characters on the left side of the text segment, the right adjacent word set is a set indicating adjacent characters on the right side of the text segment, the information entropy of the text segment is obtained by calculating the information entropy of the left adjacent word set and the information entropy of the right adjacent word set, and the minimum information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
4. According to the claimsSolving 3 the Chinese word segmentation method based on the word association characteristics, which is characterized in that: acquiring the information entropy H of a text segment by calculating the information entropy of the left adjacent character set and the right adjacent character set, namely H min { s ', s' },
Figure FDA0003122130300000031
h represents the degree of freedom of the candidate word, s 'represents the left entropy of the candidate word, s' is the right entropy of the candidate word, and the minimum information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.
5. The method of claim 1, wherein the method comprises: the condensation degree calculation mode is as follows: order to
Figure FDA0003122130300000032
Taking the minimum M as the degree of condensation, wherein AB represents a new word, P (AB) represents the probability of the new word appearing in the text, A and B respectively represent a combined word, P (A) represents the probability of the combined word A appearing in the text, and P (B) represents the probability of the combined word B appearing in the text.
6. The method of claim 1, wherein the method comprises: the filtering method of the ternary candidate words comprises the following steps: for the ternary candidate word, if the last two characters exist in the binary candidate word library, judging whether the first character and the left adjacent character form a word, if the first two characters exist in the binary candidate word library, judging whether the last character and the right adjacent character form a word, and determining whether the ternary candidate word is a candidate word;
if it is
Figure FDA0003122130300000033
Then (A)i-1AiAi+1) Belong to a three-element word;
wherein (A)i-1AiAi+1) Belonging to ternary candidate words, Ai-2Is a left-adjacent word set of ternary candidate words, Ai+2Is the right adjacent word of the ternary candidate word, { A0....Ai....ANIs a set of characters in a corpus, { (A)0,A1)...(Ai,Ai+1)...(Ai-2,Ai-1)Is a binary candidate set of words.
7. The method of claim 1, wherein the method comprises: the filtering method of the quaternary candidate words comprises the following steps: the method comprises the following steps that two conditions are required to be met for a quaternary candidate word which can possibly become a word, the condition 1 is that firstly, segmentation is carried out, the first two characters are a word segmentation segment, the second two characters are another word segmentation segment, the two word segmentation segments are matched with a well-segmented binary word bank respectively, the two word segmentation segments which are matched are used as a preselected word of the quaternary candidate word, the condition 2 is that then, the middle two characters of the quaternary word are segmented and matched with the well-segmented binary word bank, the middle two characters which are not matched with the well-segmented binary word bank are used as the preselected word of the quaternary candidate word, and the preselected word of the quaternary candidate word meeting the two conditions is used as a word segmentation result.
CN201711293044.8A 2017-12-08 2017-12-08 Chinese word segmentation method based on word association characteristics Active CN108845982B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711293044.8A CN108845982B (en) 2017-12-08 2017-12-08 Chinese word segmentation method based on word association characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711293044.8A CN108845982B (en) 2017-12-08 2017-12-08 Chinese word segmentation method based on word association characteristics

Publications (2)

Publication Number Publication Date
CN108845982A CN108845982A (en) 2018-11-20
CN108845982B true CN108845982B (en) 2021-08-20

Family

ID=64211732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711293044.8A Active CN108845982B (en) 2017-12-08 2017-12-08 Chinese word segmentation method based on word association characteristics

Country Status (1)

Country Link
CN (1) CN108845982B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858011B (en) * 2018-11-30 2022-08-19 平安科技(深圳)有限公司 Standard word bank word segmentation method, device, equipment and computer readable storage medium
CN110334345A (en) * 2019-06-17 2019-10-15 首都师范大学 New word discovery method
CN110287488A (en) * 2019-06-18 2019-09-27 上海晏鼠计算机技术股份有限公司 A kind of Chinese text segmenting method based on big data and Chinese feature
CN110287493B (en) * 2019-06-28 2023-04-18 中国科学技术信息研究所 Risk phrase identification method and device, electronic equipment and storage medium
CN110442861B (en) * 2019-07-08 2023-04-07 万达信息股份有限公司 Chinese professional term and new word discovery method based on real world statistics
CN111125329B (en) * 2019-12-18 2023-07-21 东软集团股份有限公司 Text information screening method, device and equipment
CN112711944B (en) * 2021-01-13 2023-03-10 深圳前瞻资讯股份有限公司 Word segmentation method and system, and word segmentation device generation method and system
CN116431930A (en) * 2023-06-13 2023-07-14 天津联创科技发展有限公司 Technological achievement conversion data query method, system, terminal and storage medium
CN116541527B (en) * 2023-07-05 2023-09-29 国网北京市电力公司 Document classification method based on model integration and data expansion

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622341A (en) * 2012-04-20 2012-08-01 北京邮电大学 Domain ontology concept automatic-acquisition method based on Bootstrapping technology
CN105488098A (en) * 2015-10-28 2016-04-13 北京理工大学 Field difference based new word extraction method
CN105955950A (en) * 2016-04-29 2016-09-21 乐视控股(北京)有限公司 New word discovery method and device
CN106126495A (en) * 2016-06-16 2016-11-16 北京捷通华声科技股份有限公司 A kind of based on large-scale corpus prompter method and apparatus
CN107180025A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of recognition methods of neologisms and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622341A (en) * 2012-04-20 2012-08-01 北京邮电大学 Domain ontology concept automatic-acquisition method based on Bootstrapping technology
CN105488098A (en) * 2015-10-28 2016-04-13 北京理工大学 Field difference based new word extraction method
CN105955950A (en) * 2016-04-29 2016-09-21 乐视控股(北京)有限公司 New word discovery method and device
CN106126495A (en) * 2016-06-16 2016-11-16 北京捷通华声科技股份有限公司 A kind of based on large-scale corpus prompter method and apparatus
CN107180025A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of recognition methods of neologisms and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于改进的正向最大匹配中文分词算法研究;王惠仙 等;《贵州大学学报(自然科学版)》;20111031;第28卷(第5期);第112-119页 *

Also Published As

Publication number Publication date
CN108845982A (en) 2018-11-20

Similar Documents

Publication Publication Date Title
CN108845982B (en) Chinese word segmentation method based on word association characteristics
US10949709B2 (en) Method for determining sentence similarity
CN105786991B (en) In conjunction with the Chinese emotion new word identification method and system of user feeling expression way
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN108710611B (en) Short text topic model generation method based on word network and word vector
CN112818694A (en) Named entity recognition method based on rules and improved pre-training model
CN106598940A (en) Text similarity solution algorithm based on global optimization of keyword quality
CN109947951B (en) Automatically-updated emotion dictionary construction method for financial text analysis
CN109492105B (en) Text emotion classification method based on multi-feature ensemble learning
CN109829151B (en) Text segmentation method based on hierarchical dirichlet model
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN110674296B (en) Information abstract extraction method and system based on key words
CN106598941A (en) Algorithm for globally optimizing quality of text keywords
CN104598530B (en) A kind of method that field term extracts
CN110929022A (en) Text abstract generation method and system
CN111177375A (en) Electronic document classification method and device
CN111061873B (en) Multi-channel text classification method based on Attention mechanism
CN106610949A (en) Text feature extraction method based on semantic analysis
CN109299463B (en) Emotion score calculation method and related equipment
CN107038155A (en) The extracting method of text feature is realized based on improved small-world network model
CN111858900B (en) Method, device, equipment and storage medium for generating question semantic parsing rule template
CN112948527A (en) Improved TextRank keyword extraction method and device
CN116453013A (en) Video data processing method and device
CN107807918A (en) The method and device of Thai words recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant