CN103473221A - Chinese lexical analysis method - Google Patents

Chinese lexical analysis method Download PDF

Info

Publication number
CN103473221A
CN103473221A CN2013104215385A CN201310421538A CN103473221A CN 103473221 A CN103473221 A CN 103473221A CN 2013104215385 A CN2013104215385 A CN 2013104215385A CN 201310421538 A CN201310421538 A CN 201310421538A CN 103473221 A CN103473221 A CN 103473221A
Authority
CN
China
Prior art keywords
chinese
word
sequence
sigma
morphological information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013104215385A
Other languages
Chinese (zh)
Other versions
CN103473221B (en
Inventor
于江德
刘运通
王希杰
胡顺义
郑霞
葛彦强
王继鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anyang Normal University
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201310421538.5A priority Critical patent/CN103473221B/en
Priority claimed from CN201310421538.5A external-priority patent/CN103473221B/en
Publication of CN103473221A publication Critical patent/CN103473221A/en
Application granted granted Critical
Publication of CN103473221B publication Critical patent/CN103473221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese lexical analysis method. The Chinese lexical analysis method comprises the following steps of (1) obtaining a characteristic function and a weight from a given training corpus; (2) segmenting an input Chinese text: segmenting the input Chinese text into multiple statements, wherein one statement is a word sequence; (3) calculating a conditional probability of all possible lexical information tagging sequences of the word sequences corresponding to the input Chinese text; (4) determining a final lexical information tagging sequence of the word sequences corresponding to the input Chinese text; (5) carrying out Chinese word segmenting, Chinese POS (Part-of-Speech) tagging and Chinese named entity recognizing, and thus obtaining a final Chinese lexical analysis result. According to the Chinese lexical analysis method disclosed by the invention, three subtasks of Chinese lexical analysis are realized by being unified in a word sequence tagging framework, the defects that error is upwards transmitted, amplified and accumulated, and multiple classes of information are difficult to integrate and utilize are overcome, the calculation is simple, and the operation amount is small; a dictionary is not needed at all, and unknown words can also be better segmented and tagged.

Description

The Chinese lexical analysis method
Technical field
The present invention relates to a kind of Chinese lexical analysis method.
Background technology
In the Chinese information processing field, Chinese lexical analysis is an important basic research problem wherein.It is not only the basis of the deep layer Chinese information processing such as syntactic analysis, semantic analysis, text understanding, is also the key link of the application such as mechanical translation, question answering system, information retrieval and information extraction.Chinese lexical analysis mainly comprises Chinese word segmenting, part-of-speech tagging and three subtasks of named entity recognition, at home and abroad in some relevant evaluation and tests, usually using they as three independently task evaluated and tested.In existing research, most of scholar also is accustomed to three subtasks are independently got up to take in, and especially gets used to Chinese word segmenting and part-of-speech tagging are processed successively, considers the part-of-speech tagging problem after participle on the word sequence basis again.This method by three subtask independent processing of Chinese lexical analysis easily makes the mistake upwards to transmit to amplify and adds up, and multiclass information is difficult to integrate the deficiency of utilizing.
For this problem of Chinese lexical analysis, integrated the exploring of some scholars to the participle of Chinese lexical analysis, part-of-speech tagging, three tasks of named entity recognition also arranged.Document [1] (Liu Qun, Zhang Huaping, Yu Hongkui, Deng. the Chinese lexical analysis based on stacked hidden horse model. Journal of Computer Research and Development, 2004,41 (8): 1421-1429.) disclose a kind of Chinese lexical analysis based on stacked hidden horse model, the method is integrated into Chinese word segmenting, part-of-speech tagging and unregistered word identification in a complete theoretical frame, but the method also needs the support of dictionary, to part-of-speech tagging, be also to carry out on the basis of word sequence.Patent documentation [2] (being entitled as of Inst. of Computing Techn. Academia Sinica's Chinese patent application that disclosed publication number is CN101295295A 29 days October in 2008 of submitting on June 13rd, 2008 " the Chinese lexical analysis method based on linear model ",) a kind of Chinese lexical analysis method based on linear model disclosed, the method adopts the perceptron model word for word to analyze statement, draw participle mark and the part-of-speech tagging of current word, for the lexical analysis of Chinese sentence.The method exists calculation of complex, deficiency that operand is large, and that the identification of named entity is not included.
In view of this, special proposition the present invention.
Summary of the invention
The purpose of this invention is to provide a kind of Chinese lexical analysis method that three task unifications of Chinese lexical analysis in word sequence framework, are fully broken away to dictionary and comprised named entity recognition.
For solving the problems of the technologies described above, the present invention adopts the basic conception of technical scheme to be:
A kind of Chinese lexical analysis method comprises the following steps:
1) obtain fundamental function and weight from a given corpus:
Setting sample window size, and selected feature templates collection, sample window size according to described setting from a given corpus expands contextual feature by described feature templates collection, the corresponding stack features function of each feature, the corresponding many stack features function of the described contextual feature of many groups, and ask for described many stack features function weight, a plurality of weights form weight vectors;
2) Chinese language text of cutting input: by the Chinese language text cutting of input, be a plurality of statements, a statement is a word sequence;
The conditional probability of all possible morphological information flag sequence of the word sequence that 3) Chinese language text of calculating input is corresponding:
Obtain all possible morphological information flag sequence of each word sequence that the Chinese language text of described input is corresponding, calculate the conditional probability of every kind of described morphological information flag sequence; Wherein, the sequence that the morphological information mark that described morphological information flag sequence is all words in a word sequence forms, described morphological information mark comprises lexeme information, part of speech information and named entity information three classes;
4) determine the final morphological information flag sequence of the word sequence that the Chinese language text of described input is corresponding:
The morphological information flag sequence that will have the highest conditional probability value is defined as the final morphological information flag sequence of the word sequence that the Chinese language text of described input is corresponding;
5) carry out Chinese word segmenting, Part of Speech Tagging and Chinese named entity recognition, thereby obtain final Chinese lexical analysis result:
Carry out Chinese word segmenting according to " lexeme information " in final morphological information flag sequence, obtain word segmentation result;
Carry out Part of Speech Tagging according to " part of speech information " in final morphological information flag sequence, obtain the part-of-speech tagging result; Or carry out the Chinese named entity recognition according to " named entity information " in final morphological information flag sequence, obtain the named entity recognition result;
For the situation of multi-character words, choose part of speech information in the morphological information mark of suffix word or named entity information part of speech or the named entity as whole word.
Further, the step of calculating the conditional probability of every kind of described morphological information flag sequence step 3) is specially:
If the word sequence corresponding to Chinese language text of input is O={o 1, o 2..., o t, the morphological information flag sequence of described word sequence is S={s 1, s 2..., s t, s wherein 1, s 2..., s to with described word sequence 1, o 2..., o tcorresponding one by one, weight vectors is Λ={ λ 1, λ 2..., λ k, the conditional probability of described morphological information flag sequence is:
P Δ ( S | O ) = 1 Z O exp ( Σ t = 1 T Σ k = 1 K λ k f k ( S t - 1 , S t , o , t ) ) - - - ( 1 )
Z onormalized factor, f k(s t-1, s t, o, t) and be a fundamental function, be a two-value characterizes function, λ kdescribed fundamental function f k(s t-1, s t, o, t) weight.
Further, described step 1) in by the logarithm maximum likelihood estimate to ask for described many stack features function weight, concrete steps are:
Described corpus is expressed as:
Figure BDA0000382472210000032
wherein, the word sequence that the Chinese language text of input is corresponding
Figure BDA0000382472210000033
as input data sequence; The morphological information flag sequence of described word sequence
Figure BDA0000382472210000034
for corresponding output data sequence, under described corpus D, log-likelihood is:
L Δ = Σ i = 1 N log P ( S ( i ) | O ( i ) ) , - - - ( 4 )
By formula (1) substitution (4) formula:
L Δ = Σ i = 1 N ( Σ t = 1 T Σ k = 1 K λ k f k ( S t - 1 ( i ) , S t ( i ) , o ( i ) , t ) - log Z O ( i ) ) - - - ( 5 )
Adopt Gauss's priori to be adjusted weight, after adjusting, (5) formula becomes:
L Δ = Σ i = 1 N Σ t = 1 T Σ k = 1 K λ k f k ( S t - 1 ( i ) , S t ( i ) , o ( i ) , t ) - Σ i = 1 N log Z O ( i ) - Σ k λ k 2 2 σ 2 - - - ( 6 )
Wherein
Figure BDA0000382472210000041
gauss's priori value of the weight for being adjusted, σ 2mean prior variance; Formula (6) is carried out to first derivation, obtain formula (7)
∂ L Δ ∂ λ k = Σ i = 1 N Σ t = 1 T f k ( s t - 1 ( i ) , s t ( i ) , o ( i ) , t ) - Σ i = 1 N Σ t = 1 T Σ s , s ′ f k ( s , s ′ o ( i ) , t ) p ( s , s ′ | o ( i ) ) - λ k σ 2 , - - - ( 7 )
Wherein,
Figure BDA0000382472210000043
for fundamental function f kdivide the expectation value planted in experience;
Figure BDA0000382472210000044
for fundamental function f kexpectation value under weight vectors Λ, calculate weight λ according to formula (7) koptimal value as the weight of fundamental function.
Preferably, described step 1) sample window in is set as " 5 this window of printed words ", and the feature templates collection is chosen to be " TMPT-10 ".
Described lexeme information comprises in B--prefix, M--word, the E--suffix becomes word with the S--individual character; Described named entity information is name, place name or organizational structure's name.
Beneficial effect of the present invention is: 1. by three subtasks of Chinese lexical analysis: Chinese word segmenting, part-of-speech tagging, named entity recognition unification realize in word sequence labelling framework, lexeme, part of speech, named entity three class morphological informations have been comprised in the mark of each word, the Chinese lexical analysis marked based on trinity morphological information, therefore having overcome wrong upwards transmission amplifies cumulative, and multiclass information is difficult to integrate the deficiency of utilizing, and calculating is simple, operand is little, can significantly improve the precision of Chinese word segmenting, part-of-speech tagging, named entity recognition;
2. break away from dictionary fully, realize real without the dictionary Chinese lexical analysis;
3. to unregistered word language also cutting preferably and mark, especially name, place name, organizational structure's name three class named entities.
The accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is the schematic diagram of " 5 this window of printed words " in the specific embodiment of the invention.
Embodiment
In order to make those skilled in the art person understand better the present invention program, below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.
With reference to Fig. 1, the present invention is a kind of Chinese lexical analysis method, comprises the following steps:
Z1, obtain fundamental function and weight from a given corpus:
Setting sample window size, and selected feature templates collection, sample window size according to described setting from a given corpus expands contextual feature by described feature templates collection, the corresponding stack features function of each feature, the corresponding many stack features function of the described contextual feature of many groups, and ask for described many stack features function weight, a plurality of weights form weight vectors;
The key of feature selecting is the contextual feature suitable according to concrete task choosing, comprises and chooses context and set the feature templates collection, and namely the setting of sample window size and feature templates collection is selected;
Generally, contextual choosing is based on current word left and right certain limit and carries out, and this fixing scope is called as " window ".Context essence in window is a specific sample, so this window is called to " sample window ".Can limit sample window is " 5 word window ", uses each two words of current word front and back as context, and also can limit sample window is " 3 word window ", uses each word of current word front and back as context.The present invention adopts " 5 word window " as sample window, with reference to Fig. 2.
The feature templates collection is the set of feature templates, and the major function of feature templates is the language element of some ad-hoc location in defining context or information and the associated situation of certain class event to be predicted.Because the present invention determines the morphological information mark of this word according to the current word in a word sequence and context thereof, so just by the word of appearance before and after this word, the position that combination, lexeme information, part of speech information, named entity information and these information of word occur, determine contextual feature.Traditionally, feature templates can be regarded as abstract that a group context feature is carried out according to common attribute.Each feature is a corresponding stack features function.
Under " 5 this window of printed words " shown in Fig. 2, can according to the character-spacing attribute of the word occurred in feature templates and current word, carry out abstract by contextual feature.If limiting sample window is " 5 word window ", the contextual feature of these specific tasks refers to the feature that current word itself and current word each two words of front and back and morphological information mark thereof form.By " 5 this window of printed words ", common contextual feature is abstract down is 10 classes, respectively: C -2, C -1, C 0, C 1, C 2, C -2c -1, C -1c 0, C 0c 1, C 1c 2, C -1c 1, T -1t 0, remember that the feature templates that these feature templates form integrates as TMPT-10.Wherein, the C in template nrepresent current word and the current word word of some apart.For example, C 0mean current word, C 1a rear word that means current word, C -1mean the previous word of current word, the rest may be inferred.In addition, last feature templates is: T -1t 0, this feature templates is for characterizing the state transitions feature T of adjacent two the morphological information marks of context i-1→ T i.
The Chinese language text of Z2, cutting input: by the Chinese language text cutting of input, be a plurality of statements, a statement is a word sequence C 1c 2... C n;
Each word sequence can be labeled as a plurality of morphological information flags sequence, why say " can be labeled as a plurality of morphological information flags sequence ", the meaning is that a statement generally can be labeled as a plurality of morphological information flags sequence theoretically, but final definite morphological information flag sequence is most probable mark sequence, be defined as the morphological information flag sequence of statement, therefore need to carry out step Z3;
The conditional probability of all possible morphological information flag sequence of the word sequence that the Chinese language text of Z3, calculating input is corresponding:
Obtain all possible morphological information flag sequence of each word sequence that the Chinese language text of described input is corresponding, calculate the conditional probability of every kind of described morphological information flag sequence; Wherein, the sequence that the morphological information mark that described morphological information flag sequence is all words in a word sequence forms, described morphological information mark comprises lexeme information, part of speech information and named entity information three classes, and its mark pattern is " lexeme part of speech or named entity ";
Z4, determine the final morphological information flag sequence of the word sequence that the Chinese language text of described input is corresponding:
The morphological information flag sequence that will have the highest conditional probability value is defined as the final morphological information flag sequence of the word sequence that the Chinese language text of described input is corresponding, but final morphological information flag sequence through type (3) means
S * = arg max s P Δ ( S | O ) , - - - ( 3 ) ;
P Δ(S|O) implication refers to hereinafter.
Z5, carry out Chinese word segmenting, Part of Speech Tagging and Chinese named entity recognition, thereby obtain final Chinese lexical analysis result:
Carry out Chinese word segmenting according to " lexeme information " in final morphological information flag sequence, obtain word segmentation result;
Carry out Part of Speech Tagging according to " part of speech information " in final morphological information flag sequence, obtain the part-of-speech tagging result; Or carry out the Chinese named entity recognition according to " named entity information " in final morphological information flag sequence, obtain the named entity recognition result;
For the situation of multi-character words, choose part of speech information in the morphological information mark of suffix word or named entity information part of speech or the named entity as whole word.
Further, in step Z3, the step of the conditional probability of every kind of described morphological information flag sequence of calculating is specially:
If the word sequence corresponding to Chinese language text of input is O={o 1, o 2..., o t, the morphological information flag sequence of described word sequence is S={s 1, s 2..., s t, s wherein 1, s 2..., s to with described word sequence 1, o 2..., o tcorresponding one by one, weight vectors is Λ={ λ 1, λ 2..., λ k, the conditional probability of described morphological information flag sequence is:
P Δ ( S | O ) = 1 Z O exp ( Σ t = 1 T Σ k = 1 K λ k f k ( s t - 1 , s t , o , t ) ) - - - ( 1 )
Z onormalized factor, f k(s t-1, s t, o, t) and be a fundamental function, be a two-value characterizes function, λ kdescribed fundamental function f k(s t-1, s t, o, t) weight, span can be-∞ is to+∞, fundamental function f k(s t-1, s t, o, t) and can integrate any contextual morphological information feature, comprise state transitions s t-1→ s tfeature, and word sequence O (herein as observation sequence) is in all morphological information features of moment t (current word location).
Further, in described step Z1 by the logarithm maximum likelihood estimate to ask for described many stack features function weight, concrete steps are:
Described corpus is expressed as: wherein, the word sequence that the Chinese language text of input is corresponding
Figure BDA0000382472210000073
as input data sequence; The morphological information flag sequence of described word sequence
Figure BDA0000382472210000074
for corresponding output data sequence, under described corpus D, log-likelihood is:
L Δ = Σ i = 1 N log P ( S ( i ) | O ( i ) ) , - - - ( 4 )
By formula (1) substitution (4) formula:
L Δ = Σ i = 1 N ( Σ t = 1 T Σ k = 1 K λ k f k ( s t - 1 ( i ) , s t ( i ) , o ( i ) , t ) - log Z O ( i ) ) - - - ( 5 )
Adopt Gauss's priori to be adjusted weight, after adjusting, (5) formula becomes:
L Δ = Σ i = 1 N Σ t = 1 T Σ k = 1 K λ k f k ( s t - 1 ( i ) , s t ( i ) , o ( i ) , t ) - Σ i = 1 N log Z O ( i ) - Σ k λ k 2 2 σ 2 - - - ( 6 )
Wherein gauss's priori value of the weight for being adjusted, σ 2mean prior variance; Formula (6) is carried out to first derivation, obtain formula (7)
∂ L Δ ∂ λ k = Σ i = 1 N Σ t = 1 T f k ( s t - 1 ( i ) , s t ( i ) , o ( i ) , t ) - Σ i = 1 N Σ t = 1 T Σ s , s ′ f k ( s , s ′ , o ( i ) , t ) p ( s , s ′ | o ( i ) ) - λ k σ 2 , - - - ( 7 )
Wherein,
Figure BDA0000382472210000085
for fundamental function f kdivide the expectation value planted in experience;
Figure BDA0000382472210000086
for fundamental function f kexpectation value under weight vectors Λ, calculate weight λ according to formula (7) koptimal value as the weight of fundamental function.
Wherein, described lexeme information comprises in B--prefix, M--word, the E--suffix becomes word with the S--individual character; Described named entity information is name, place name or organizational structure's name, and by PER, LOC, ORG sign, part of speech information is noun, verb, adjective, preposition etc. respectively.
Below illustrate the present invention.
For Chinese sentence S1 " Chinese Government resumes the exercise of sovereignty over Hong Kong smoothly, ", utilize the Chinese lexical analysis method based on trinity morphological information mark of the present invention to carry out Chinese lexical analysis.Can regard this statement S1 as word sequence: C 1(in) C 2(state) ... C n(), this word sequence can have multiple corresponding morphological information flag sequence.For example, possible a kind of flag sequence S11 is " in/B_ORG state/M_ORG political affairs/M_ORG mansion/E_ORG is suitable/B_ad profit/E_ad is extensive/B_v is multiple/E_v is right/S_p perfume (or spice)/B_LOC port/E_LOC is capable/make/E_v master of B_v/B_n power/E_n ,/S_wd ".The flag sequence S12 of another possibility is " in/B_LOC state/E_LOC political affairs/B_n mansion/E_n is suitable/B_ad profit/E_ad is extensive/B_v is multiple/E_v is right/S_p perfume (or spice)/B_LOC port/E_LOC is capable/make/E_v master of B_v/B_n power/E_n ,/S_wd ".
In morphological information flag sequence S11, word sequence " Chinese Government resumes the exercise of sovereignty over Hong Kong smoothly, " correspondence markings sequence respectively is " B_ORG M_ORG M_ORG E_ORG B_ad E_ad B_v E_v S_p B_LOC E_LOC B_v E_v B_n E_n S_wd ".In this morphological information flag sequence, " Chinese Government " is marked as a named entity (organizational structure's name).And, in morphological information flag sequence S12, " Chinese Government " is marked as " in/B_LOC state/E_LOC political affairs/B_n mansion/E_n ", has been marked as a named entity (place name) and a noun.
In the present invention, pending statement to each, try to achieve the conditional probability of morphological information flag sequence S11 and the conditional probability of morphological information flag sequence S12 according to formula (1).
After having calculated the conditional probability of all possible morphological information flag sequence, next, mark according to each morphological information flag sequence, get the morphological information flag sequence of conditional probability value " the highest " as final morphological information flag sequence, obtain the morphological information flag sequence of processed statement.For example, to top Chinese sentence S1 " Chinese Government resumes the exercise of sovereignty over Hong Kong smoothly; ", definite morphological information flag sequence is S11 " in/B_ORG state/M_ORG political affairs/M_ORG mansion/E_ORG is suitable/B_ad profit/E_ad is extensive/B_v is multiple/E_v is right/S_p perfume (or spice)/B_LOC port/E_LOC is capable/make/E_v master of B_v/B_n power/E_n ,/S_wd ".Then carry out Chinese word segmenting according to lexeme information, can obtain word segmentation result, by part of speech information or named entity information, can obtain part-of-speech tagging and named entity recognition result; Choose part of speech in the morphological information mark of suffix word or named entity part of speech or the named entity as whole word for multi-character words, comprehensively these results just obtain corresponding lexical analysis result.The Chinese lexical analysis result of statement S1 is " Chinese Government/ORG smoothly/ad recovery/v is right/p Hong Kong/LOC enforcement/v sovereignty/n ,/wd ".
When pending Chinese language text consists of a plurality of Chinese sentences, successively each Chinese sentence is as above operated, obtain the lexical analysis result of each statement, the Chinese lexical analysis result of whole Chinese language text has just obtained.
The present invention has following advantage:
1. by three subtasks of Chinese lexical analysis: Chinese word segmenting, part-of-speech tagging, named entity recognition unification realize in word sequence labelling framework, lexeme, part of speech, named entity three class morphological informations have been comprised in the mark of each word, the Chinese lexical analysis marked based on trinity morphological information, therefore having overcome wrong upwards transmission amplifies cumulative, and multiclass information is difficult to integrate the deficiency of utilizing, and calculating is simple, operand is little, can significantly improve the precision of Chinese word segmenting, part-of-speech tagging, named entity recognition;
2. break away from dictionary fully, realize real without the dictionary Chinese lexical analysis.
3. to unregistered word language also cutting preferably and mark, especially name, place name, organizational structure's name three class named entities.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (6)

1. a Chinese lexical analysis method, is characterized in that, comprises the following steps:
1) obtain fundamental function and weight from a given corpus:
Setting sample window size, and selected feature templates collection, sample window size according to described setting from a given corpus expands contextual feature by described feature templates collection, the corresponding stack features function of each feature, the corresponding many stack features function of the described contextual feature of many groups, and ask for described many stack features function weight, a plurality of weights form weight vectors;
2) Chinese language text of cutting input: by the Chinese language text cutting of input, be a plurality of statements, a statement is a word sequence;
The conditional probability of all possible morphological information flag sequence of the word sequence that 3) Chinese language text of calculating input is corresponding:
Obtain all possible morphological information flag sequence of each word sequence that the Chinese language text of described input is corresponding, calculate the conditional probability of every kind of described morphological information flag sequence; Wherein, the sequence that the morphological information mark that described morphological information flag sequence is all words in a word sequence forms, described morphological information mark comprises lexeme information, part of speech information and named entity information three classes;
4) determine the final morphological information flag sequence of the word sequence that the Chinese language text of described input is corresponding:
The morphological information flag sequence that will have the highest conditional probability value is defined as the final morphological information flag sequence of the word sequence that the Chinese language text of described input is corresponding;
5) carry out Chinese word segmenting, Part of Speech Tagging and Chinese named entity recognition, thereby obtain final Chinese lexical analysis result:
Carry out Chinese word segmenting according to " lexeme information " in final morphological information flag sequence, obtain word segmentation result;
Carry out Part of Speech Tagging according to " part of speech information " in final morphological information flag sequence, obtain the part-of-speech tagging result; Or carry out the Chinese named entity recognition according to " named entity information " in final morphological information flag sequence, obtain the named entity recognition result;
For the situation of multi-character words, choose part of speech information in the morphological information mark of suffix word or named entity information part of speech or the named entity as whole word.
2. Chinese lexical analysis method according to claim 1, is characterized in that step 3) in calculate the conditional probability of every kind of described morphological information flag sequence step be specially:
If the word sequence corresponding to Chinese language text of input is O={o 1, o 2..., o t, the morphological information flag sequence of described word sequence is S={s 1, s 2..., s t, s wherein 1, s 2..., s to with described word sequence 1, o 2..., o tcorresponding one by one, weight vectors is Λ={ λ 1, λ 2..., λ k, the conditional probability of described morphological information flag sequence is:
P Δ ( S | O ) = 1 Z O exp ( Σ t = 1 T Σ k = 1 K λ k f k ( s t - 1 , s t , o , t ) ) - - - ( 1 )
Z onormalized factor, f k(s t-1, s t, o, t) and be a fundamental function, be a two-value characterizes function, λ kdescribed fundamental function f k(s t-1, s t, o, t) weight.
3. Chinese lexical analysis method according to claim 2, is characterized in that, described step 1) in by the logarithm maximum likelihood estimate to ask for described many stack features function weight, concrete steps are:
Described corpus is expressed as:
Figure FDA0000382472200000022
wherein, the word sequence that the Chinese language text of input is corresponding
Figure FDA0000382472200000023
as input data sequence; The morphological information flag sequence of described word sequence for corresponding output data sequence, under described corpus D, log-likelihood is:
L Δ = Σ i = 1 N log P ( S ( i ) | O ( i ) ) - - - ( 4 )
By formula (1) substitution (4) formula:
L Δ = Σ i = 1 N ( Σ t = 1 T Σ k = 1 K λ k f k ( s t - 1 ( i ) , s t ( i ) , o ( i ) , t ) - log Z o ( i ) ) - - - ( 5 )
Adopt Gauss's priori to be adjusted weight, after adjusting, (5) formula becomes:
L Δ = Σ i = 1 N Σ t = 1 T Σ k = 1 K λ k f k ( s t - 1 ( i ) , s t ( i ) , o ( i ) , t ) - Σ i = 1 N log Z o ( i ) - Σ k λ k 2 2 σ 2 - - - ( 6 )
Wherein
Figure FDA0000382472200000028
gauss's priori value of the weight for being adjusted, σ 2mean prior variance; Formula (6) is carried out to first derivation, obtain formula (7)
∂ L Δ ∂ λ k = Σ i = 1 N Σ t = 1 T f k ( s t - 1 ( i ) , s t ( i ) , o ( i ) , t ) - Σ i = 1 N Σ t = 1 T Σ s , s ′ f k ( s , s ′ , o ( i ) , t ) p ( s , s ′ | o ( i ) ) - λ k σ 2 , - - - ( 7 )
Wherein,
Figure FDA0000382472200000032
for fundamental function f kdivide the expectation value planted in experience
Figure FDA0000382472200000033
for fundamental function f kexpectation value under weight vectors Λ, calculate weight λ according to formula (7) koptimal value as the weight of fundamental function.
4. according to the described Chinese lexical analysis method of any one in claim 1-3, it is characterized in that described step 1) in sample window be set as " 5 this window of printed words ", the feature templates collection is chosen to be " TMPT-10 ".
5. according to the described Chinese lexical analysis method of any one in claim 1-3, it is characterized in that, described lexeme information comprises in B--prefix, M--word, the E--suffix becomes word with the S--individual character; Described named entity information is name, place name or organizational structure's name.
6. Chinese lexical analysis method according to claim 4, is characterized in that, described lexeme information comprises in B--prefix, M--word, the E--suffix becomes word with the S--individual character; Described named entity information is name, place name or organizational structure's name.
CN201310421538.5A 2013-09-16 Chinese lexical analysis method Active CN103473221B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310421538.5A CN103473221B (en) 2013-09-16 Chinese lexical analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310421538.5A CN103473221B (en) 2013-09-16 Chinese lexical analysis method

Publications (2)

Publication Number Publication Date
CN103473221A true CN103473221A (en) 2013-12-25
CN103473221B CN103473221B (en) 2016-11-30

Family

ID=

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930432A (en) * 2016-04-19 2016-09-07 北京百度网讯科技有限公司 Training method and apparatus for sequence labeling tool
CN107038163A (en) * 2016-02-03 2017-08-11 常州普适信息科技有限公司 A kind of text semantic modeling method towards magnanimity internet information
CN107329974A (en) * 2017-05-26 2017-11-07 福建师范大学 Data extraction method and its system for HLS optimizations
CN107608973A (en) * 2016-07-12 2018-01-19 华为技术有限公司 A kind of interpretation method and device based on neutral net
CN107832781A (en) * 2017-10-18 2018-03-23 扬州大学 A kind of software defect towards multi-source data represents learning method
CN108595431A (en) * 2018-04-28 2018-09-28 海信集团有限公司 Interactive voice text error correction method, device, terminal and storage medium
CN108763225A (en) * 2016-06-28 2018-11-06 大连民族大学 The interpretation method of the multi-lingual machine translation subsystem of attribute information
CN108959242A (en) * 2018-05-08 2018-12-07 中国科学院信息工程研究所 A kind of target entity recognition methods and device based on Chinese character part of speech feature
CN110597994A (en) * 2019-09-17 2019-12-20 北京百度网讯科技有限公司 Event element identification method and device
CN110807316A (en) * 2019-10-30 2020-02-18 安阳师范学院 Chinese word selecting and blank filling method
CN112364623A (en) * 2020-11-02 2021-02-12 安阳师范学院 Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method
CN113095082A (en) * 2021-04-15 2021-07-09 湖南四方天箭信息科技有限公司 Method, device, computer device and computer readable storage medium for text processing based on multitask model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295295A (en) * 2008-06-13 2008-10-29 中国科学院计算技术研究所 Chinese language lexical analysis method based on linear model
CN101950284A (en) * 2010-09-27 2011-01-19 北京新媒传信科技有限公司 Chinese word segmentation method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295295A (en) * 2008-06-13 2008-10-29 中国科学院计算技术研究所 Chinese language lexical analysis method based on linear model
CN101950284A (en) * 2010-09-27 2011-01-19 北京新媒传信科技有限公司 Chinese word segmentation method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
于江德,王希杰,樊孝忠: "字标注汉语词法分析中上文和下文孰重孰轻", 《计算机科学》 *
于江德,睢丹,樊孝忠: "基于字的词位标注汉语分词", 《山东大学学报(工学版)》 *
史树敏 等: "基于条件随机域的中文命名实体识别", 《第三届学生计算语言学研讨会》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038163A (en) * 2016-02-03 2017-08-11 常州普适信息科技有限公司 A kind of text semantic modeling method towards magnanimity internet information
CN105930432A (en) * 2016-04-19 2016-09-07 北京百度网讯科技有限公司 Training method and apparatus for sequence labeling tool
CN105930432B (en) * 2016-04-19 2020-01-07 北京百度网讯科技有限公司 Training method and device for sequence labeling tool
CN108763225A (en) * 2016-06-28 2018-11-06 大连民族大学 The interpretation method of the multi-lingual machine translation subsystem of attribute information
CN107608973A (en) * 2016-07-12 2018-01-19 华为技术有限公司 A kind of interpretation method and device based on neutral net
CN107329974A (en) * 2017-05-26 2017-11-07 福建师范大学 Data extraction method and its system for HLS optimizations
CN107832781A (en) * 2017-10-18 2018-03-23 扬州大学 A kind of software defect towards multi-source data represents learning method
CN107832781B (en) * 2017-10-18 2021-09-14 扬州大学 Multi-source data-oriented software defect representation learning method
CN108595431B (en) * 2018-04-28 2020-09-25 海信集团有限公司 Voice interaction text error correction method, device, terminal and storage medium
CN108595431A (en) * 2018-04-28 2018-09-28 海信集团有限公司 Interactive voice text error correction method, device, terminal and storage medium
CN108959242A (en) * 2018-05-08 2018-12-07 中国科学院信息工程研究所 A kind of target entity recognition methods and device based on Chinese character part of speech feature
CN110597994A (en) * 2019-09-17 2019-12-20 北京百度网讯科技有限公司 Event element identification method and device
CN110807316A (en) * 2019-10-30 2020-02-18 安阳师范学院 Chinese word selecting and blank filling method
CN110807316B (en) * 2019-10-30 2023-08-15 安阳师范学院 Chinese word selecting and filling method
CN112364623A (en) * 2020-11-02 2021-02-12 安阳师范学院 Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method
CN113095082A (en) * 2021-04-15 2021-07-09 湖南四方天箭信息科技有限公司 Method, device, computer device and computer readable storage medium for text processing based on multitask model

Similar Documents

Publication Publication Date Title
US10289952B2 (en) Semantic frame identification with distributed word representations
CN110287480B (en) Named entity identification method, device, storage medium and terminal equipment
Kanakaraddi et al. Survey on parts of speech tagger techniques
Gupta et al. A survey of common stemming techniques and existing stemmers for indian languages
CN106570180A (en) Artificial intelligence based voice searching method and device
CN110929520B (en) Unnamed entity object extraction method and device, electronic equipment and storage medium
CN106569993A (en) Method and device for mining hypernym-hyponym relation between domain-specific terms
CN110502742B (en) Complex entity extraction method, device, medium and system
Outahajala et al. Pos tagging in Amazighe using support vector machines and conditional random fields
CN111046660B (en) Method and device for identifying text professional terms
CN113268576B (en) Deep learning-based department semantic information extraction method and device
Etaiwi et al. Statistical Arabic name entity recognition approaches: A survey
CN108959630A (en) A kind of character attribute abstracting method towards English without structure text
CN108763192B (en) Entity relation extraction method and device for text processing
CN103678288A (en) Automatic proper noun translation method
Cing et al. Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language
CN110929518A (en) Text sequence labeling algorithm using overlapping splitting rule
CN113591488A (en) Semantic analysis method and device
Yousif et al. Part of speech tagger for Arabic text based support vector machines: A review
Sharma et al. Full-page handwriting recognition and automated essay scoring for in-the-wild essays
CN114842982B (en) Knowledge expression method, device and system for medical information system
Khorjuvenkar et al. Parts of speech tagging for Konkani language
Hirpassa Information extraction system for Amharic text
Srinivasagan et al. An automated system for tamil named entity recognition using hybrid approach
Fresko et al. A hybrid approach to NER by MEMM and manual rules

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20170207

Address after: School of computer and Information Engineering Anyang Normal University No. 436 Anyang City, Henan province the 455000 Avenue

Patentee after: Anyang Normal University

Address before: School of computer and Information Engineering Anyang Normal University No. 436 Anyang City, Henan province the 455000 Avenue

Patentee before: Yu Jiangde

Patentee before: Liu Yuntong

Patentee before: Wang Xijie