CN103020230A

CN103020230A - Semantic fuzzy matching method

Info

Publication number: CN103020230A
Application number: CN2012105438390A
Authority: CN
Inventors: 张艳; 李艳玲; 徐为群; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2012-12-14
Filing date: 2012-12-14
Publication date: 2013-04-03

Abstract

The embodiment of the invention provides a semantic fuzzy matching method. The method comprises the following steps of: extracting characteristics of the text identified by voice to obtain the characteristic data; carrying out named entity identification on the characteristic data by a conditional random field (CRF) to find the key semantic categories of sentences; and accurately matching the key semantic categories, performing fuzzy matching when the accurate match is failed, calculating the similarity of the key semantic categories and the key words in the dictionary, selecting the key words with largest similarity to replace the key semantic categories, and marking the categories. By the method of the embodiment, the CRF is used for marking the sequence, the key semantic categories in the inquire statement are initially marked and located; the fuzzy matching range is shortened; the similarity is calculated according to the domain dictionary; the dictionary entries with the largest similarity are used for replacing the wrong key semantic categories in the user query; the calculation amount is reduced; and the identifying speed is improved.

Description

A kind of Semantic fuzzy matching method

Technical field

The application relates to field of speech recognition, specifically, relates to a kind of Semantic fuzzy matching method.

Background technology

Man-machine interactive system is to propose query requests by the user by spoken language, and system provides information service.A typical man-machine interactive system comprises: automatic speech recognition, speech understanding, these four ingredients of dialogue management and phonetic synthesis.Speech understanding partly is that the query statement after the speech recognition is changed into corresponding semantic expressiveness.Yet; speech understanding often can run into such problem; be that user's query statement exists the imperfect of pronunciation variation, identification error and crucial semantic concept that speech recognition brings; how still can obtain the correct result that understands in the situation that obtains the part key message, this just need to improve with fuzzy matching the robustness of system.Common man-machine interaction service all is limited to some specific area, and the data of association area all can be kept in the database.Traditional fuzzy matching algorithm mainly is the reference position of finding out in given text string with the substring of pattern matching, majority is to use editing distance as similarity function, each Chinese character in such method in user's query statement will participate in computing, if sentence comparison is long, then arithmetic speed will reduce greatly.

Summary of the invention

For the problems of the prior art, the purpose of the embodiment of the invention is to provide a kind of Semantic fuzzy matching method, and described method comprises: the text after the speech recognition is carried out feature extraction, obtain characteristic; With condition random field CRF model described characteristic is carried out the identification of named entity, find crucial semantic category in the sentence; Described crucial semantic category is carried out exact matching, when the exact matching failure, carry out fuzzy matching, calculate the similarity of keyword in described crucial semantic category and the dictionary, select the larger keyword of similarity to substitute described crucial semantic category, and carry out the classification mark.

Preferably, the similarity of keyword in the described crucial semantic category of described calculating and the dictionary, specifically comprise, with the twice of the Chinese character number of the common factor of the vocabulary of described crucial semantic category and the keyword number sum divided by all Chinese characters of the vocabulary of described crucial semantic category and keyword, the merchant of gained is larger, and similarity is higher.

Preferably, described CRF model obtains by following steps: according to field structure training data, training data covers the common saying of various spoken languages as far as possible; Training data is marked, namely mark out the classification of substantive noun in the training data; Training data is carried out feature extraction, extract substantive noun; With CRF the substantive noun that extracts is trained, obtain the CRF model.

Preferably, described method also comprises: described crucial semantic category through the classification mark is carried out semantic understanding, provide semantic expressiveness.

Preferably, the keyword that described similarity is larger is the keyword of similarity maximum.

Preferably, described keyword is the dictionary entry.

The method of embodiment of the invention utilization statistics, be CRF (conditional random field, condition random field) carries out sequence labelling, crucial semantic category in the query statement is tentatively marked and locates, dwindle the scope of fuzzy matching, and then according to the field dictionary, carrying out similarity calculates, dictionary entry with the similarity maximum replaces the crucial semantic category of makeing mistakes in user's inquiry, has reduced operand, has improved the speed of identification.

Description of drawings

Fig. 1 is the speech understanding system schematic of the embodiment of the invention;

Fig. 2 is the schematic flow sheet of the Semantic fuzzy matching method of the embodiment of the invention.

Embodiment

Below in conjunction with the drawings and specific embodiments the present invention is carried out detailed, clear, complete explanation.Obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making all other embodiment that obtain under the creative work prerequisite.

Fig. 1 is the speech understanding system schematic of the embodiment of the invention.Among Fig. 1, semantic coupling and understanding system comprise speech recognition system, semantic category mark part, semantic understanding part.Wherein the backup of semantic category mark comprises again three unit: feature extraction unit, exact matching unit, fuzzy matching unit.Wherein feature extraction unit need to CRF model cooperating.

Particularly, semantic category mark part need to be carried out feature extraction to the text after the speech recognition, then carry out the identification of named entity by a CRF model that trains, find semantic concept crucial in the sentence, send into exact matching and partly carry out the classification mark, if the exact matching failure, then enter fuzzy matching, by calculating the similarity of keyword in the substantive noun marked and the dictionary, select optimum vocabulary to revise, and carry out classification and mark.Then send into the semantic understanding part, provide the semantic expressiveness of this sentence, feed back to the user by Query Database.Being noted that the voice here can be people's voice, can be natural-sounding also, is not particularly limited at this.

Here adopt the CRF graph model of chain structure, note observation string for W=(w1, w2 ... wn), string (state) sequence be Y=(y1, y2 ... yn), it is defined as follows:

p_{λ} (Y | W) = \frac{1}{Z (W)} \exp (\underset{t &Element; T}{Σ} \underset{k}{Σ} λ_{k} f_{k} (y_{t - 1}, y_{t}, W, t)) - - - (1)

F wherein _kFundamental function, λ _kBe the weight of characteristic of correspondence function, t is mark, and Z (W) is normalized factor, so that above-mentioned probability distribution is between (0,1).

The model parameter estimation of CRF is finished with the L-BFGS algorithm usually.The decode procedure of CRF is the process of finding the solution unknown string mark, needs to search for a maximum joint probability of calculating on this string, that is:

Y ^*＝argmax _YP(Y|W) (2)

On linear chain CRF, this calculation task can be finished with the Viterbi algorithm.

According to the training data of field structure CRF, data will cover the common saying of various spoken languages as far as possible, and will comprise the various fields that use in the native system.

Training data is marked, namely mark out the classification of the substantive noun in each query statement.

Feature extraction, in order better to extract the various substantive nouns (comprising name and other nouns) that relate to, according to the characteristics of Chinese personal name word-building, we have set up the everyday character dictionary of using word and name about the surname of Chinese personal name, are used for the structural attitude template.Simultaneously for name and video display name are extracted more accurately, counted individual character and the double word that appears at name and video display name front and back position by mass data, set up name and field name about finger circle word dictionary, carry out the extraction of feature.Refer to about described that boundary's word dictionary refers to the vocabulary that appears at name or field name the right and left in a word.Such as: I want to listen the song of Liu De China.Liu Dehua is name, and the left margin word that appears at Liu Dehua is " listening ", and the right margin word is " ", refers to about being also can be called border, left and right sides word by boundary's word.

With CRF the training data that has extracted feature is trained, obtain a CRF model.That be noted that the training of condition random field uses is Open-Source Tools CRF++; The roughly step of training comprises: carry out the extraction of feature according to the form of training text because for be spoken, word may be introduced the mistake of participle as research object, so select individual character to carry out feature extraction as research object; Select which feature not only to depend on and also depend on template file in the instrument for the training text that has extracted feature, namely except the individual character feature, also will use the assemblage characteristic between the feature; Can obtain a model file after the training; The process of test is to prepare the file of a test, needs equally to extract feature, and form must be the same with the text of training, then tests with the training good model, obtains the annotation results for each word.

For the query statement of user input, carry out feature extraction and carry out Entity recognition with the CRF model that has trained with said method, Primary Location the crucial semantic category in the sentence.

Whether the crucial semantic category of having had good positioning may be wrong, also may not have mistake, at this moment at first carries out exact matching, namely judges the semantic category of CRF identification, exist in the dictionary of field, if there is no then carries out fuzzy matching.

With the Dice similarity semantic category and the entry in the dictionary of field that CRF identifies carried out similarity calculating, the Dice calculating formula of similarity is as follows:

Sim {(A, B)}_{Dice} = \frac{2 \cdot | A \cap B |}{| A | + | B |} - - - (3)

The twice of the Chinese character number of occuring simultaneously with two vocabulary remove with two vocabulary length and.Seek the entry of similarity maximum the mistake in the former sentence is replaced, just finished the fuzzy matching of semantic category.

Fig. 2 is the schematic flow sheet of the Semantic fuzzy matching method of the embodiment of the invention.As shown in Figure 2, described method comprises: step 200, extract characteristic; Be specially: the text after the speech recognition is carried out feature extraction, obtain characteristic; Step 202 is obtained crucial semantic category; Be specially: with condition random field CRF model described characteristic is carried out the identification of named entity, find crucial semantic category; Step 204, exact matching, be specially described crucial semantic category is carried out exact matching, when the exact matching success, described crucial semantic category is carried out the classification mark, and enter step 208, semantic understanding is specially described crucial semantic category through the classification mark is carried out semantic understanding, provides semantic expressiveness.In step 204, when the exact matching failure, enter step 206, carry out fuzzy matching, calculate the similarity of keyword in described crucial semantic category and the dictionary, select the larger keyword of similarity to substitute described crucial semantic category, and carry out the classification mark, enter again subsequently step 208.

Preferably, described keyword is the dictionary entry.

Those skilled in the art should further recognize, unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein, can realize with electronic hardware, computer software or the combination of the two, for the interchangeability of hardware and software clearly is described, composition and the step of each example described in general manner according to function in the above description.These functions are carried out with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Those skilled in the art can specifically should be used for realizing described function with distinct methods to each, but this realization should not thought the scope that exceeds the application.

The method of describing in conjunction with embodiment disclosed herein or the step of algorithm can use the software module of hardware, processor execution, and perhaps the combination of the two is implemented.Software module can place the storage medium of any other form known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or the technical field.

Above-described embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the above only is the specific embodiment of the present invention; and be not used in the protection domain that limits the application; all within the application's spirit and principle, any modification of making, be equal to replacement, improvement etc., all should be included within the application's the protection domain.

Claims

1. a Semantic fuzzy matching method is characterized in that, described method comprises:

Text after the speech recognition is carried out feature extraction, obtain characteristic;

With condition random field CRF model described characteristic is carried out the identification of named entity, find crucial semantic category;

Described crucial semantic category is carried out exact matching, when the exact matching failure, carry out fuzzy matching, calculate the similarity of keyword in described crucial semantic category and the dictionary, select the larger keyword of similarity to substitute described crucial semantic category, and carry out the classification mark.

2. Semantic fuzzy matching method as claimed in claim 1, it is characterized in that, the similarity of keyword in the described crucial semantic category of described calculating and the dictionary, specifically comprise, with the twice of the Chinese character number of the common factor of the vocabulary of described crucial semantic category and the keyword number sum divided by all Chinese characters of the vocabulary of described crucial semantic category and keyword, the merchant of gained is larger, and similarity is higher.

3. Semantic fuzzy matching method as claimed in claim 1 is characterized in that, described CRF model obtains by following steps:

According to field structure training data, training data covers the common saying of various spoken languages as far as possible;

Training data is marked, namely mark out the classification of substantive noun in the training data;

Training data is carried out feature extraction, extract substantive noun;

With CRF the substantive noun that extracts is trained, obtain the CRF model.

4. such as the described Semantic fuzzy matching method of one of claim 1-3, it is characterized in that described method also comprises: described crucial semantic category through the classification mark is carried out semantic understanding, provide semantic expressiveness.

5. such as the described Semantic fuzzy matching method of one of claim 1-3, it is characterized in that the keyword that described similarity is larger is the keyword of similarity maximum.

6. such as the described Semantic fuzzy matching method of one of claim 1-3, it is characterized in that described keyword is the dictionary entry.