CN103473221A

CN103473221A - Chinese lexical analysis method

Info

Publication number: CN103473221A
Application number: CN2013104215385A
Authority: CN
Inventors: 于江德; 刘运通; 王希杰; 胡顺义; 郑霞; 葛彦强; 王继鹏
Original assignee: Individual
Current assignee: Anyang Normal University
Priority date: 2013-09-16
Filing date: 2013-09-16
Publication date: 2013-12-25
Anticipated expiration: 2033-09-16

Abstract

The invention discloses a Chinese lexical analysis method. The Chinese lexical analysis method comprises the following steps of (1) obtaining a characteristic function and a weight from a given training corpus; (2) segmenting an input Chinese text: segmenting the input Chinese text into multiple statements, wherein one statement is a word sequence; (3) calculating a conditional probability of all possible lexical information tagging sequences of the word sequences corresponding to the input Chinese text; (4) determining a final lexical information tagging sequence of the word sequences corresponding to the input Chinese text; (5) carrying out Chinese word segmenting, Chinese POS (Part-of-Speech) tagging and Chinese named entity recognizing, and thus obtaining a final Chinese lexical analysis result. According to the Chinese lexical analysis method disclosed by the invention, three subtasks of Chinese lexical analysis are realized by being unified in a word sequence tagging framework, the defects that error is upwards transmitted, amplified and accumulated, and multiple classes of information are difficult to integrate and utilize are overcome, the calculation is simple, and the operation amount is small; a dictionary is not needed at all, and unknown words can also be better segmented and tagged.

Description

The Chinese lexical analysis method

Technical field

The present invention relates to a kind of Chinese lexical analysis method.

Background technology

In the Chinese information processing field, Chinese lexical analysis is an important basic research problem wherein.It is not only the basis of the deep layer Chinese information processing such as syntactic analysis, semantic analysis, text understanding, is also the key link of the application such as mechanical translation, question answering system, information retrieval and information extraction.Chinese lexical analysis mainly comprises Chinese word segmenting, part-of-speech tagging and three subtasks of named entity recognition, at home and abroad in some relevant evaluation and tests, usually using they as three independently task evaluated and tested.In existing research, most of scholar also is accustomed to three subtasks are independently got up to take in, and especially gets used to Chinese word segmenting and part-of-speech tagging are processed successively, considers the part-of-speech tagging problem after participle on the word sequence basis again.This method by three subtask independent processing of Chinese lexical analysis easily makes the mistake upwards to transmit to amplify and adds up, and multiclass information is difficult to integrate the deficiency of utilizing.

For this problem of Chinese lexical analysis, integrated the exploring of some scholars to the participle of Chinese lexical analysis, part-of-speech tagging, three tasks of named entity recognition also arranged.Document [1] (Liu Qun, Zhang Huaping, Yu Hongkui, Deng. the Chinese lexical analysis based on stacked hidden horse model. Journal of Computer Research and Development, 2004,41 (8): 1421-1429.) disclose a kind of Chinese lexical analysis based on stacked hidden horse model, the method is integrated into Chinese word segmenting, part-of-speech tagging and unregistered word identification in a complete theoretical frame, but the method also needs the support of dictionary, to part-of-speech tagging, be also to carry out on the basis of word sequence.Patent documentation [2] (being entitled as of Inst. of Computing Techn. Academia Sinica's Chinese patent application that disclosed publication number is CN101295295A 29 days October in 2008 of submitting on June 13rd, 2008 " the Chinese lexical analysis method based on linear model ",) a kind of Chinese lexical analysis method based on linear model disclosed, the method adopts the perceptron model word for word to analyze statement, draw participle mark and the part-of-speech tagging of current word, for the lexical analysis of Chinese sentence.The method exists calculation of complex, deficiency that operand is large, and that the identification of named entity is not included.

In view of this, special proposition the present invention.

Summary of the invention

The purpose of this invention is to provide a kind of Chinese lexical analysis method that three task unifications of Chinese lexical analysis in word sequence framework, are fully broken away to dictionary and comprised named entity recognition.

For solving the problems of the technologies described above, the present invention adopts the basic conception of technical scheme to be:

A kind of Chinese lexical analysis method comprises the following steps:

1) obtain fundamental function and weight from a given corpus:

Setting sample window size, and selected feature templates collection, sample window size according to described setting from a given corpus expands contextual feature by described feature templates collection, the corresponding stack features function of each feature, the corresponding many stack features function of the described contextual feature of many groups, and ask for described many stack features function weight, a plurality of weights form weight vectors;

2) Chinese language text of cutting input: by the Chinese language text cutting of input, be a plurality of statements, a statement is a word sequence;

The conditional probability of all possible morphological information flag sequence of the word sequence that 3) Chinese language text of calculating input is corresponding:

Obtain all possible morphological information flag sequence of each word sequence that the Chinese language text of described input is corresponding, calculate the conditional probability of every kind of described morphological information flag sequence; Wherein, the sequence that the morphological information mark that described morphological information flag sequence is all words in a word sequence forms, described morphological information mark comprises lexeme information, part of speech information and named entity information three classes;

4) determine the final morphological information flag sequence of the word sequence that the Chinese language text of described input is corresponding:

The morphological information flag sequence that will have the highest conditional probability value is defined as the final morphological information flag sequence of the word sequence that the Chinese language text of described input is corresponding;

5) carry out Chinese word segmenting, Part of Speech Tagging and Chinese named entity recognition, thereby obtain final Chinese lexical analysis result:

Carry out Chinese word segmenting according to " lexeme information " in final morphological information flag sequence, obtain word segmentation result;

Carry out Part of Speech Tagging according to " part of speech information " in final morphological information flag sequence, obtain the part-of-speech tagging result; Or carry out the Chinese named entity recognition according to " named entity information " in final morphological information flag sequence, obtain the named entity recognition result;

For the situation of multi-character words, choose part of speech information in the morphological information mark of suffix word or named entity information part of speech or the named entity as whole word.

Further, the step of calculating the conditional probability of every kind of described morphological information flag sequence step 3) is specially:

If the word sequence corresponding to Chinese language text of input is O={o ₁, o ₂..., o _t, the morphological information flag sequence of described word sequence is S={s ₁, s ₂..., s _t, s wherein ₁, s ₂..., s _to with described word sequence ₁, o ₂..., o _tcorresponding one by one, weight vectors is Λ={ λ ₁, λ ₂..., λ _k, the conditional probability of described morphological information flag sequence is:

P_{Δ} (S | O) = \frac{1}{Z_{O}} \exp (Σ_{t = 1}^{T} Σ_{k = 1}^{K} λ_{k} f_{k} (S_{t - 1}, S_{t}, o, t)) - - - (1)

Z _onormalized factor, f _k(s _t-1, s _t, o, t) and be a fundamental function, be a two-value characterizes function, λ _kdescribed fundamental function f _k(s _t-1, s _t, o, t) weight.

Further, described step 1) in by the logarithm maximum likelihood estimate to ask for described many stack features function weight, concrete steps are:

Described corpus is expressed as:

wherein, the word sequence that the Chinese language text of input is corresponding

as input data sequence; The morphological information flag sequence of described word sequence

for corresponding output data sequence, under described corpus D, log-likelihood is:

L_{Δ} = Σ_{i = 1}^{N} \log P (S^{(i)} | O^{(i)}), - - - (4)

By formula (1) substitution (4) formula:

L_{Δ} = Σ_{i = 1}^{N} (Σ_{t = 1}^{T} Σ_{k = 1}^{K} λ_{k} f_{k} (S_{t - 1}^{(i)}, S_{t}^{(i)}, o^{(i)}, t) - \log Z_{O^{(i)}}) - - - (5)

Adopt Gauss's priori to be adjusted weight, after adjusting, (5) formula becomes:

L_{Δ} = Σ_{i = 1}^{N} Σ_{t = 1}^{T} Σ_{k = 1}^{K} λ_{k} f_{k} (S_{t - 1}^{(i)}, S_{t}^{(i)}, o^{(i)}, t) - Σ_{i = 1}^{N} \log Z_{O^{(i)}} - \underset{k}{Σ} \frac{λ_{k}^{2}}{{2 σ}^{2}} - - - (6)

Wherein

gauss's priori value of the weight for being adjusted, σ ²mean prior variance; Formula (6) is carried out to first derivation, obtain formula (7)

\frac{&PartialD; L_{Δ}}{&PartialD; λ_{k}} = Σ_{i = 1}^{N} Σ_{t = 1}^{T} f_{k} (s_{t - 1}^{(i)}, s_{t}^{(i)}, o^{(i)}, t) - Σ_{i = 1}^{N} Σ_{t = 1}^{T} \underset{s, s^{'}}{Σ} f_{k} (s, s^{'} o^{(i)}, t) p (s, s^{'} | o^{(i)}) - \frac{λ_{k}}{σ^{2}}, - - - (7)

Wherein,

for fundamental function f _kdivide the expectation value planted in experience;

for fundamental function f _kexpectation value under weight vectors Λ, calculate weight λ according to formula (7) _koptimal value as the weight of fundamental function.

Preferably, described step 1) sample window in is set as " 5 this window of printed words ", and the feature templates collection is chosen to be " TMPT-10 ".

Described lexeme information comprises in B--prefix, M--word, the E--suffix becomes word with the S--individual character; Described named entity information is name, place name or organizational structure's name.

Beneficial effect of the present invention is: 1. by three subtasks of Chinese lexical analysis: Chinese word segmenting, part-of-speech tagging, named entity recognition unification realize in word sequence labelling framework, lexeme, part of speech, named entity three class morphological informations have been comprised in the mark of each word, the Chinese lexical analysis marked based on trinity morphological information, therefore having overcome wrong upwards transmission amplifies cumulative, and multiclass information is difficult to integrate the deficiency of utilizing, and calculating is simple, operand is little, can significantly improve the precision of Chinese word segmenting, part-of-speech tagging, named entity recognition;

2. break away from dictionary fully, realize real without the dictionary Chinese lexical analysis;

3. to unregistered word language also cutting preferably and mark, especially name, place name, organizational structure's name three class named entities.

The accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention;

Fig. 2 is the schematic diagram of " 5 this window of printed words " in the specific embodiment of the invention.

Embodiment

In order to make those skilled in the art person understand better the present invention program, below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.

With reference to Fig. 1, the present invention is a kind of Chinese lexical analysis method, comprises the following steps:

Z1, obtain fundamental function and weight from a given corpus:

The key of feature selecting is the contextual feature suitable according to concrete task choosing, comprises and chooses context and set the feature templates collection, and namely the setting of sample window size and feature templates collection is selected;

Generally, contextual choosing is based on current word left and right certain limit and carries out, and this fixing scope is called as " window ".Context essence in window is a specific sample, so this window is called to " sample window ".Can limit sample window is " 5 word window ", uses each two words of current word front and back as context, and also can limit sample window is " 3 word window ", uses each word of current word front and back as context.The present invention adopts " 5 word window " as sample window, with reference to Fig. 2.

The feature templates collection is the set of feature templates, and the major function of feature templates is the language element of some ad-hoc location in defining context or information and the associated situation of certain class event to be predicted.Because the present invention determines the morphological information mark of this word according to the current word in a word sequence and context thereof, so just by the word of appearance before and after this word, the position that combination, lexeme information, part of speech information, named entity information and these information of word occur, determine contextual feature.Traditionally, feature templates can be regarded as abstract that a group context feature is carried out according to common attribute.Each feature is a corresponding stack features function.

Under " 5 this window of printed words " shown in Fig. 2, can according to the character-spacing attribute of the word occurred in feature templates and current word, carry out abstract by contextual feature.If limiting sample window is " 5 word window ", the contextual feature of these specific tasks refers to the feature that current word itself and current word each two words of front and back and morphological information mark thereof form.By " 5 this window of printed words ", common contextual feature is abstract down is 10 classes, respectively: C _-2, C _-1, C ₀, C ₁, C ₂, C _-2c _-1, C _-1c ₀, C ₀c ₁, C ₁c ₂, C _-1c ₁, T _-1t ₀, remember that the feature templates that these feature templates form integrates as TMPT-10.Wherein, the C in template _nrepresent current word and the current word word of some apart.For example, C ₀mean current word, C ₁a rear word that means current word, C _-1mean the previous word of current word, the rest may be inferred.In addition, last feature templates is: T _-1t ₀, this feature templates is for characterizing the state transitions feature T of adjacent two the morphological information marks of context _i-1→ T _i.

The Chinese language text of Z2, cutting input: by the Chinese language text cutting of input, be a plurality of statements, a statement is a word sequence C ₁c ₂... C _n;

Each word sequence can be labeled as a plurality of morphological information flags sequence, why say " can be labeled as a plurality of morphological information flags sequence ", the meaning is that a statement generally can be labeled as a plurality of morphological information flags sequence theoretically, but final definite morphological information flag sequence is most probable mark sequence, be defined as the morphological information flag sequence of statement, therefore need to carry out step Z3;

The conditional probability of all possible morphological information flag sequence of the word sequence that the Chinese language text of Z3, calculating input is corresponding:

Obtain all possible morphological information flag sequence of each word sequence that the Chinese language text of described input is corresponding, calculate the conditional probability of every kind of described morphological information flag sequence; Wherein, the sequence that the morphological information mark that described morphological information flag sequence is all words in a word sequence forms, described morphological information mark comprises lexeme information, part of speech information and named entity information three classes, and its mark pattern is " lexeme part of speech or named entity ";

Z4, determine the final morphological information flag sequence of the word sequence that the Chinese language text of described input is corresponding:

The morphological information flag sequence that will have the highest conditional probability value is defined as the final morphological information flag sequence of the word sequence that the Chinese language text of described input is corresponding, but final morphological information flag sequence through type (3) means

S^{*} = \underset{s}{\arg \max} P_{Δ} (S | O), - - - (3);

P _Δ(S|O) implication refers to hereinafter.

Z5, carry out Chinese word segmenting, Part of Speech Tagging and Chinese named entity recognition, thereby obtain final Chinese lexical analysis result:

Further, in step Z3, the step of the conditional probability of every kind of described morphological information flag sequence of calculating is specially:

P_{Δ} (S | O) = \frac{1}{Z_{O}} \exp (Σ_{t = 1}^{T} Σ_{k = 1}^{K} λ_{k} f_{k} (s_{t - 1}, s_{t}, o, t)) - - - (1)

Z _onormalized factor, f _k(s _t-1, s _t, o, t) and be a fundamental function, be a two-value characterizes function, λ _kdescribed fundamental function f _k(s _t-1, s _t, o, t) weight, span can be-∞ is to+∞, fundamental function f _k(s _t-1, s _t, o, t) and can integrate any contextual morphological information feature, comprise state transitions s _t-1→ s _tfeature, and word sequence O (herein as observation sequence) is in all morphological information features of moment t (current word location).

Further, in described step Z1 by the logarithm maximum likelihood estimate to ask for described many stack features function weight, concrete steps are:

Described corpus is expressed as: wherein, the word sequence that the Chinese language text of input is corresponding

L_{Δ} = Σ_{i = 1}^{N} \log P (S^{(i)} | O^{(i)}), - - - (4)

By formula (1) substitution (4) formula:

L_{Δ} = Σ_{i = 1}^{N} (Σ_{t = 1}^{T} Σ_{k = 1}^{K} λ_{k} f_{k} (s_{t - 1}^{(i)}, s_{t}^{(i)}, o^{(i)}, t) - \log Z_{O^{(i)}}) - - - (5)

L_{Δ} = Σ_{i = 1}^{N} Σ_{t = 1}^{T} Σ_{k = 1}^{K} λ_{k} f_{k} (s_{t - 1}^{(i)}, s_{t}^{(i)}, o^{(i)}, t) - Σ_{i = 1}^{N} \log Z_{O^{(i)}} - \underset{k}{Σ} \frac{λ_{k}^{2}}{2 σ^{2}} - - - (6)

Wherein gauss's priori value of the weight for being adjusted, σ ²mean prior variance; Formula (6) is carried out to first derivation, obtain formula (7)

\frac{&PartialD; L_{Δ}}{&PartialD; λ_{k}} = Σ_{i = 1}^{N} Σ_{t = 1}^{T} f_{k} (s_{t - 1}^{(i)}, s_{t}^{(i)}, o^{(i)}, t) - Σ_{i = 1}^{N} Σ_{t = 1}^{T} \underset{s, s^{'}}{Σ} f_{k} (s, s^{'}, o^{(i)}, t) p (s, s^{'} | o^{(i)}) - \frac{λ_{k}}{σ^{2}}, - - - (7)

Wherein,

Wherein, described lexeme information comprises in B--prefix, M--word, the E--suffix becomes word with the S--individual character; Described named entity information is name, place name or organizational structure's name, and by PER, LOC, ORG sign, part of speech information is noun, verb, adjective, preposition etc. respectively.

Below illustrate the present invention.

For Chinese sentence S1 " Chinese Government resumes the exercise of sovereignty over Hong Kong smoothly, ", utilize the Chinese lexical analysis method based on trinity morphological information mark of the present invention to carry out Chinese lexical analysis.Can regard this statement S1 as word sequence: C ₁(in) C ₂(state) ... C _n(), this word sequence can have multiple corresponding morphological information flag sequence.For example, possible a kind of flag sequence S11 is " in/B_ORG state/M_ORG political affairs/M_ORG mansion/E_ORG is suitable/B_ad profit/E_ad is extensive/B_v is multiple/E_v is right/S_p perfume (or spice)/B_LOC port/E_LOC is capable/make/E_v master of B_v/B_n power/E_n ,/S_wd ".The flag sequence S12 of another possibility is " in/B_LOC state/E_LOC political affairs/B_n mansion/E_n is suitable/B_ad profit/E_ad is extensive/B_v is multiple/E_v is right/S_p perfume (or spice)/B_LOC port/E_LOC is capable/make/E_v master of B_v/B_n power/E_n ,/S_wd ".

In morphological information flag sequence S11, word sequence " Chinese Government resumes the exercise of sovereignty over Hong Kong smoothly, " correspondence markings sequence respectively is " B_ORG M_ORG M_ORG E_ORG B_ad E_ad B_v E_v S_p B_LOC E_LOC B_v E_v B_n E_n S_wd ".In this morphological information flag sequence, " Chinese Government " is marked as a named entity (organizational structure's name).And, in morphological information flag sequence S12, " Chinese Government " is marked as " in/B_LOC state/E_LOC political affairs/B_n mansion/E_n ", has been marked as a named entity (place name) and a noun.

In the present invention, pending statement to each, try to achieve the conditional probability of morphological information flag sequence S11 and the conditional probability of morphological information flag sequence S12 according to formula (1).

After having calculated the conditional probability of all possible morphological information flag sequence, next, mark according to each morphological information flag sequence, get the morphological information flag sequence of conditional probability value " the highest " as final morphological information flag sequence, obtain the morphological information flag sequence of processed statement.For example, to top Chinese sentence S1 " Chinese Government resumes the exercise of sovereignty over Hong Kong smoothly; ", definite morphological information flag sequence is S11 " in/B_ORG state/M_ORG political affairs/M_ORG mansion/E_ORG is suitable/B_ad profit/E_ad is extensive/B_v is multiple/E_v is right/S_p perfume (or spice)/B_LOC port/E_LOC is capable/make/E_v master of B_v/B_n power/E_n ,/S_wd ".Then carry out Chinese word segmenting according to lexeme information, can obtain word segmentation result, by part of speech information or named entity information, can obtain part-of-speech tagging and named entity recognition result; Choose part of speech in the morphological information mark of suffix word or named entity part of speech or the named entity as whole word for multi-character words, comprehensively these results just obtain corresponding lexical analysis result.The Chinese lexical analysis result of statement S1 is " Chinese Government/ORG smoothly/ad recovery/v is right/p Hong Kong/LOC enforcement/v sovereignty/n ,/wd ".

When pending Chinese language text consists of a plurality of Chinese sentences, successively each Chinese sentence is as above operated, obtain the lexical analysis result of each statement, the Chinese lexical analysis result of whole Chinese language text has just obtained.

The present invention has following advantage:

1. by three subtasks of Chinese lexical analysis: Chinese word segmenting, part-of-speech tagging, named entity recognition unification realize in word sequence labelling framework, lexeme, part of speech, named entity three class morphological informations have been comprised in the mark of each word, the Chinese lexical analysis marked based on trinity morphological information, therefore having overcome wrong upwards transmission amplifies cumulative, and multiclass information is difficult to integrate the deficiency of utilizing, and calculating is simple, operand is little, can significantly improve the precision of Chinese word segmenting, part-of-speech tagging, named entity recognition;

2. break away from dictionary fully, realize real without the dictionary Chinese lexical analysis.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a Chinese lexical analysis method, is characterized in that, comprises the following steps:

1) obtain fundamental function and weight from a given corpus:

2. Chinese lexical analysis method according to claim 1, is characterized in that step 3) in calculate the conditional probability of every kind of described morphological information flag sequence step be specially:

P_{Δ} (S | O) = \frac{1}{Z_{O}} \exp (Σ_{t = 1}^{T} Σ_{k = 1}^{K} λ_{k} f_{k} (s_{t - 1}, s_{t}, o, t)) - - - (1)

3. Chinese lexical analysis method according to claim 2, is characterized in that, described step 1) in by the logarithm maximum likelihood estimate to ask for described many stack features function weight, concrete steps are:

Described corpus is expressed as:

as input data sequence; The morphological information flag sequence of described word sequence for corresponding output data sequence, under described corpus D, log-likelihood is:

L_{Δ} = Σ_{i = 1}^{N} \log P (S^{(i)} | O^{(i)}) - - - (4)

By formula (1) substitution (4) formula:

L_{Δ} = Σ_{i = 1}^{N} (Σ_{t = 1}^{T} Σ_{k = 1}^{K} λ_{k} f_{k} (s_{t - 1}^{(i)}, s_{t}^{(i)}, o^{(i)}, t) - \log Z_{o^{(i)}}) - - - (5)

L_{Δ} = Σ_{i = 1}^{N} Σ_{t = 1}^{T} Σ_{k = 1}^{K} λ_{k} f_{k} (s_{t - 1}^{(i)}, s_{t}^{(i)}, o^{(i)}, t) - Σ_{i = 1}^{N} \log Z_{o^{(i)}} - \underset{k}{Σ} \frac{λ_{k}^{2}}{{2 σ}^{2}} - - - (6)

Wherein

\frac{{&PartialD; L}_{Δ}}{{&PartialD; λ}_{k}} = Σ_{i = 1}^{N} Σ_{t = 1}^{T} f_{k} (s_{t - 1}^{(i)}, s_{t}^{(i)}, o^{(i)}, t) - Σ_{i = 1}^{N} Σ_{t = 1}^{T} \underset{s, s^{'}}{Σ} f_{k} (s, s^{'}, o^{(i)}, t) p (s, s^{'} | o^{(i)}) - \frac{λ_{k}}{σ^{2}}, - - - (7)

Wherein,

for fundamental function f _kdivide the expectation value planted in experience

4. according to the described Chinese lexical analysis method of any one in claim 1-3, it is characterized in that described step 1) in sample window be set as " 5 this window of printed words ", the feature templates collection is chosen to be " TMPT-10 ".

5. according to the described Chinese lexical analysis method of any one in claim 1-3, it is characterized in that, described lexeme information comprises in B--prefix, M--word, the E--suffix becomes word with the S--individual character; Described named entity information is name, place name or organizational structure's name.

6. Chinese lexical analysis method according to claim 4, is characterized in that, described lexeme information comprises in B--prefix, M--word, the E--suffix becomes word with the S--individual character; Described named entity information is name, place name or organizational structure's name.