CN103309926A

CN103309926A - Chinese and English-named entity identification method and system based on conditional random field (CRF)

Info

Publication number: CN103309926A
Application number: CN2013100782042A
Authority: CN
Inventors: 张艳; 李艳玲; 徐为群; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2013-03-12
Filing date: 2013-03-12
Publication date: 2013-09-18

Abstract

The invention provides a Chinese and English-named entity identification method and a system based on a conditional random field (CRF). The method comprises the following steps: (101) converting inquiry voice of a user into a text; (102) separating text information into Chinese characters and English letters on the basis of a finite state machine; (103) extracting the characteristics of a text of separated vocabularies; (104) performing entity identification on the text by adopting a training CRF model according to a characteristic extraction result, and marking an entity type, wherein the CRF model is a conditional random field model of a linear chain structure. The step (102) further comprises the following steps: (102-1) performing character separation on Chinese and English; (102-2) identifying English word strings by using the finite state machine, namely, combining adjacent English letters, blank spaces and symbols in English; (102-3) performing word segmentation on the English word strings.

Description

Chinese and English mixing named entity recognition method and system based on condition random field

Technical field

The present invention relates to the sequence labelling model of finite state machine and condition random field, in interactive process, the phenomenon that user's query statement exists Chinese and English to mix, the method and system of Chinese and English named entity recognition of mixing are carried out in proposition to sentence.

Background technology

Man-machine interactive system is to propose query requests by the user by spoken language, and system provides information service.A typical man-machine interactive system comprises: automatic speech recognition, speech understanding, these four ingredients of dialogue management and phonetic synthesis.Speech understanding partly is that the query statement after the speech recognition is changed into corresponding semantic expressiveness.Yet along with the large fusion of internationalization information, the saying that multilingual mixes is seen everywhere, and this has just brought difficulty to speech understanding.And especially more common with the saying of Chinese and English mixing in the multilingual mixing saying, for the inquiry field of man-machine interactive system about the video display song, especially outstanding Chinese and English mixes saying and comes from some external video display songs and some English name-tos etc.The task that man-machine interaction service will be finished is how no matter user's Chinese and English expresses, and this Chinese and English interactive service system can both carry out correct understanding.And the substantive noun that wherein corresponding user inquires about is namely found out in the identification that step is exactly named entity of the process of understanding.The task of traditional named entity recognition is carried out mainly for pure English or pure Chinese.The Entity recognition of pure English is owing to having the interval between the English word, so Entity recognition does not need to carry out participle, identification is easier to; The Entity recognition of pure Chinese is larger with respect to the Entity recognition difficulty of pure English, yet the Entity recognition difficulty of mixing for the Chinese and English in the spoken language is just larger, simultaneously spoken have the characteristics such as grammer is lack of standardization, random than written word, and it is insurmountable therefore only using the named entity recognition method of prior art.

Summary of the invention

The object of the invention is to, for overcoming above-mentioned technical matters, the invention provides a kind of method and system of mixing named entity recognition based on the Chinese and English of condition random field.

For achieving the above object, the invention provides a kind of Chinese and English mixing named entity recognition method based on condition random field, described method comprises:

Step 101) for the step that user's voice inquirement is converted to text;

Step 102) for the step that text message is separated into Chinese individual character and English word based on finite state machine;

Step 103) for the step that the text of separating character is carried out feature extraction;

Step 104) is used for according to the result of feature extraction and the CRF model of employing training the character that separates being carried out Entity recognition, marks entity class;

Wherein, described CRF model is the conditional random field models of linear chain structure, and described named entity refers to: name, place name and mechanism's name, and for field of media, described named entity specially refers to name, song title (comprising video display name, website name, TV station); Described Entity recognition is that the type of the character ownership that will separate is identified, and for example, the song that I want to listen Liu De China is water lustily.Here " Liu Dehua " is the name entity, and " lustily water " is the song title entity.There are some researches show that spoken language text is better than the result who obtains with participle as research object with individual character, so carry out for each word when carrying out Entity recognition.So-called statistical model is exactly the type of judging current word by each word and context thereof.Such as training time, " my song water lustily of wanting to listen Liu De China " the words will be noted as: the bent O of Wo O Xiang O Ting O Liu B-PER De I-PER China I-PER De O song O Wang B-NAME feelings I-NAME water I-NAME.Wherein " O " represents other; " B " expression " begin ", the beginning of entity; " I " represents inner, " PER " and " NAME " distinguish classification, name and the field name of presentation-entity.During test, for the sentence of input, by the training good model, just can go out the classification of each word by automatic marking, thereby obtain each entity.

Above-mentioned steps 102) further comprise:

Step 102-1) Chinese and English is carried out character separation;

Step 102-2) carries out the identification of English word with finite state machine, namely merge the symbol in adjacent English alphabet, space and the English;

Step 102-3) the english string is carried out participle.

Above-mentioned feature extraction comprises:

Current word or English word be the individual character in Chinese personal name and the surname everyday character dictionary whether;

Current word or English word whether name or video display name about refer to boundary's individual character or double word;

Current word or English word be English word whether,

Wherein, described feature extraction also comprises assemblage characteristic and the contextual feature of extracting between the above-mentioned feature.

Above-mentioned steps 104) specifically adopt following strategy to obtain entity class:

Carry out sequence labelling with the linear chain condition random field, wherein carry out the CRF model parameter estimation with the L-BFGS algorithm, obtain optimum mark sequence with the Veterbi decoding algorithm, from the flag sequence of optimum, obtain entity class at last.

In order to realize said method, the present invention also provides a kind of Chinese and English based on condition random field to mix named entity recognition system, and described system comprises:

Modular converter is used for user's voice inquirement information is converted to text message;

Pretreatment module is used for carrying out Chinese based on the text of finite state machine after with speech recognition and divides word and English string segmentation;

Characteristic extracting module is used for the text that separates vocabulary is carried out feature extraction;

Type is judged identification module, is used for according to the result of feature extraction and the CRF model of employing training text being carried out Entity recognition, marks entity class;

Wherein, the described CRF model conditional random field models that is the linear chain structure.

Above-mentioned pretreatment module further comprises:

First processes submodule, be used for text message is carried out the separation of character,

Second processes submodule, is used for based on the method for finite state machine the English character that separates being carried out the identification of english string, namely merges the symbol in adjacent English alphabet, space and the English.

The 3rd processes submodule, is used for the english string is carried out participle.

Above-mentioned feature extraction comprises:

Current word or English word be English word whether,

The above-mentioned type judges that identification module further comprises:

The CRF model is set up submodule, is used for adopting the L-BFGS algorithm to obtain the CRF model parameter, namely obtains the CRF model that trains;

The type mark submodule is used for based on the CRF model that makes up wherein also comprising the saying that Chinese and English mixes according to adopting the artificial various inquiry sayings that should the restriction field of collecting, and then manually marks, and CRF is trained; Wherein, that training tool uses is Open-Source Tools CRF++, and training step comprises: at first the form according to training text carries out feature extraction; Select individual character to carry out feature extraction as research object, when carrying out feature selecting, respectively based on single feature and contextual feature thereof, added the assemblage characteristic between each feature, train at last CRF can obtain a model file, carry out the type mark based on this model file;

The decoding submodule carries out the text of Entity recognition to needs, adopt the extraction feature consistent with the training process of type mark submodule, then tests with the training good model, and the Viterbi algorithm is decoded, and obtains the annotation results for each word.

Compared with prior art, technical advantage of the present invention is:

The present invention adopts a kind of method of finite state machine conjugation condition random field to carry out the named entity recognition that the Chinese and English in the interactive process mixes, key step is: first, pre-service (Chinese and English word-dividing mode), the text after adopting finite state machine to speech recognition carries out Chinese and English participle; The second, the extraction of feature adds the effective feature of three classes: refer to the feature of boundary's word, the regular expression feature of differentiation English about word and the name feature of word of the surname in the differentiation Chinese personal name, differentiation name and physical name; The 3rd, carry out sequence labelling with the linear chain condition random field, wherein carry out model parameter estimation with the L-BFGS algorithm, obtain optimum mark sequence with the Veterbi decoding algorithm.From the flag sequence of optimum, obtain entity class at last.The advantage of the method that the present invention proposes: at first, the query statement that mixes for the Chinese and English in the man-machine interactive system can be good at solving the Entity recognition problem, has very widely using value; Secondly, added rule-based feature under statistical framework, the training data Sparse Problems for condition random field also can well solve; At last, the method not only can solve the Entity recognition problem that the Chinese and English in the man-machine interactive system mixes, and also can promote rear solution based on the Entity recognition problem of web simultaneously.In a word, the method that the present invention proposes can be good at solving the Entity recognition problem for the query statement that the Chinese and English in the man-machine interactive system mixes, and because partly added more special rule feature in feature extraction, training data Sparse Problems for condition random field also can well solve.

Description of drawings

Fig. 1 is that Chinese and English provided by the invention mixes the Entity recognition block diagram.

Embodiment

Below in conjunction with drawings and Examples the method for the invention is elaborated.

The method that Chinese and English provided by the invention mixes named entity recognition mainly solves the named entity recognition problem that there is Chinese and English mixing phenomena in query statement in the interactive process, based on the system of the method towards research field relate to inquiry to TV station, website, application and media.Wherein media portion also comprises performer, director and singer's inquiry, and the inquiry of film, TV play and song title.

Method for Chinese and English mixing named entity recognition provided by the invention can adopt following recognition system to carry out the identification that Chinese and English mixes named entity:

At first user's voice inquirement is converted to text by speech recognition system, and next enters Chinese and English word-dividing mode, then the sentence that separates vocabulary is carried out feature extraction, carries out Entity recognition with the CRF model that has trained at last, marks entity class.Next introduce in detail the composition of each several part.

(1) preprocessing part (being Chinese and English word-dividing mode):

Mainly Chinese is carried out the separation of individual character, English is carried out the separation of word, this module is divided into two parts, and the first step is carried out the separation of character; Second step carries out the identification of english string with finite state machine, namely merge the symbol in adjacent English alphabet, space and the English.The 3rd step English string segmentation namely carries out cutting to the english string with the space.Such as:

The user inquires about former sentence: I want to listen the song of Michael Jackson

The first step (behind the character separation): I | think | listen | M|i|c|h|a|e|l||J|a|c|k|s|o|n|'s | song | song

The user inquires about former sentence: I want to listen song south of the River style

The first step (behind the character separation): I | think | listen | song | song | the river | south | s|t|y|l|e

The same second step of the 3rd step result.

The user inquires about former sentence: I want to listen the song of S.H.E.

The first step (behind the character separation): I | think | listen | S|.|H|.|E|.|'s | song | song

The same second step of the 3rd step result.

The user inquires about former sentence: ask for song I ' ll always love you to me

The first step (behind the character separation): give | I | look for | one | lower | song | song | I| ' | l|l||a|l|w|a|y|s||l|o|v|e||y|o|u

(2) named entity recognition module:

This problem of Entity recognition can be described as a given τ literal as observed value o, asks for its corresponding status switch (being entity class)

The status switch that hope obtains

Should satisfy: the setting models parameter lambda, so that posterior probability

Maximization.Here the non-directed graph model is used for posterior probability, is expressed as:

p (s_{1}^{τ} | o; λ) = \frac{1}{z (o; λ)} \exp (λ \cdot F (s_{1}^{τ}, o)) - - - (1)

Wherein, Be proper vector, feature is a series of functions that obtained by status switch and observed value; λ is parameter vector, as the weight of feature;

All status switches to be distributed carry out the normalized factor, so that above-mentioned probability distribution is between (0,1).Aspect the art of computation, it has been generally acknowledged that

What form is Markov chain, each feature f among the characteristic set F _kOnly depend on two adjacent states, so have:

p (s_{1}^{τ} | o; λ) = \frac{1}{z (o; λ)} \exp (\underset{k}{Σ} λ_{k} Σ_{t = 1}^{τ} f_{k} (s^{(t - 1)}, s^{(t)}, o, t)) - - - (2)

Mainly sentence is carried out the extraction of feature, and carry out sequence mark with the CRF model that has trained, obtain entity class mark result.Feature wherein is divided three classes: the first kind, current word (perhaps English word) be the individual character in Chinese personal name and the surname everyday character dictionary whether; Equations of The Second Kind, current word (perhaps English word) whether name or video display name about refer to boundary's individual character or double word; The 3rd class, current word (perhaps English word) be English word whether.Except these individual character features, we have also added assemblage characteristic and contextual feature between these features.

The model parameter estimation of CRF is finished with the L-BFGS algorithm usually.The decode procedure of CRF is the process of finding the solution unknown string mark, and on linear chain CRF, this calculation task can be finished with the Viterbi algorithm.

Embodiment

1, the text after the speech recognition is carried out Chinese and English participle, this part was divided into for two steps: the first step, carry out the separation of Chinese and English character, second step, carry out the identification of english string with finite state machine, namely merge the symbol in adjacent English alphabet, space and the English, the 3rd step, English string segmentation namely carries out cutting to the english string with the space.

2, the training data of structure CRF, data be the interior common saying of various spoken languages of Covering domain as far as possible.

3, training data is marked, namely mark out the classification of the substantive noun in each query statement.

4, feature extraction one: in order better to extract the various substantive nouns (comprising name and other nouns) in the field, characteristics according to the Chinese personal name word-building, we have set up the everyday character dictionary of using word and name about the surname of Chinese personal name, are used for the structural attitude template.

5, feature extraction two: for name and video display name are extracted more accurately, counted individual character and the double word that appears at name and video display name front and back position by mass data, set up name and field name about refer to boundary's word dictionary, carry out the extraction of feature.

6, feature extraction three: judge that for long English song name being identified as a complete song title, having added whether current word is English feature, judges by regular expression.

7, feature extraction four: extract the assemblage characteristic of contextual feature and above-mentioned three kinds of features, wherein contextual feature is got former and later two words of current word as contextual feature.

8, obtain the CRF model parameter with the L-BFGS algorithm, namely obtain the CRF model that trains.

9, obtain the mark of entity with the decoding of Viterbi algorithm, and finally obtain the Entity recognition result that Chinese and English mixes.

It should be noted last that, above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although with reference to embodiment the present invention is had been described in detail, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is made amendment or is equal to replacement, do not break away from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims

1. Chinese and English mixing named entity recognition method based on condition random field, described method comprises:

Step 101) for the step that user's voice inquirement information is converted to text message;

Step 103) for the step that the text that separates vocabulary is carried out feature extraction;

Step 104) is used for according to the result of feature extraction and the CRF model of employing training individual character or the word that separates being carried out Entity recognition, marks entity class;

2. the Chinese and English mixing named entity recognition method based on condition random field according to claim 1 is characterized in that described step 102) further comprise:

Step 102-1) Chinese and English is carried out character separation;

Step 102-2) carries out the identification of english string with finite state machine, namely merge the symbol in adjacent English alphabet, space and the English;

Step 102-3) the english string is carried out participle.

3. the Chinese and English mixing named entity recognition method based on condition random field according to claim 1 is characterized in that, described feature extraction comprises:

Current word or English word be English word whether,

4. the Chinese and English mixing named entity recognition method based on condition random field according to claim 1 is characterized in that described step 104) specifically adopt following strategy to obtain entity class:

5. the Chinese and English based on condition random field mixes named entity recognition system, and described system comprises:

Modular converter is used for user's voice inquirement is converted to text;

Pretreatment module is used for text is carried out Chinese minute word and English string segmentation;

Characteristic extracting module is used for the text of separating character is carried out feature extraction;

6. the Chinese and English mixing named entity recognition method based on condition random field according to claim 5 is characterized in that, described pretreatment module further comprises:

First processes submodule, is used for text message is carried out the separation of character;

Second processes submodule, is used for based on the method for finite state machine the English character that separates being carried out the identification of english string, namely merges the symbol in adjacent English alphabet, space and the English;

7. the Chinese and English mixing named entity recognition method based on condition random field according to claim 5 is characterized in that, described feature extraction comprises:

Current word or English word be English word whether,

8. the Chinese and English mixing named entity recognition method based on condition random field according to claim 5 is characterized in that, described type judges that identification module further comprises:

The decoding submodule carries out the text of Entity recognition to needs, adopt the feature extraction consistent with the training process of type mark submodule, then tests with the training good model, and the Viterbi algorithm is decoded, and obtains the annotation results for each word.