CN107180084B

CN107180084B - Word bank updating method and device

Info

Publication number: CN107180084B
Application number: CN201710313846.4A
Authority: CN
Inventors: 蒋化冰; 陈岳峰; 马晨星; 张俊杰; 谭舟; 王振超; 梁兰; 徐志强; 严婷; 郦莉
Original assignee: Shanghai Mumu Jucong Robot Technology Co ltd
Current assignee: Shanghai Mumu Jucong Robot Technology Co ltd
Priority date: 2017-05-05
Filing date: 2017-05-05
Publication date: 2020-04-21
Anticipated expiration: 2037-05-05
Also published as: CN107180084A

Abstract

The embodiment of the invention provides a word stock updating method and a device, wherein the method comprises the following steps: and training a classification model by using a training sample set formed by a plurality of general sentence samples and a plurality of sentence samples in a specific field, and obtaining a word set formed by words corresponding to each training sample on the output side of the classification model. And determining the contribution weight of each word in the word set to the classification accuracy of the classification model based on the classification result of each training sample, and selecting X words with highest contribution degree to the classification accuracy of the classification model. And respectively carrying out pronunciation similarity calculation on the X selected words and the M words with the highest word frequency, and determining a hot word bank corresponding to the X words according to a comparison result of the pronunciation similarity and a preset threshold value. And selecting hot words belonging to a specific field from the X words according to the comparison result of the pinyin similarity and the preset threshold value to form a hot word bank, and adding the hot words into the original recognition word bank, so that the recognition effect of the words in the application scene of the specific field is improved.

Description

Word bank updating method and device

Technical Field

The invention relates to the technical field of internet, in particular to a word stock updating method and device.

Background

In scenes such as human-computer interaction and voice assistance realized by a voice mode, accurate recognition of voice input by a user is an important premise. In order to recognize the voice input by the user, a voice recognition word bank is often constructed in advance.

Some speech recognition word libraries commonly used at present are provided for supporting universality, and can be called general recognition word libraries. When the universal word stock is applied to a specific field, the recognition accuracy is likely to be greatly reduced due to the lack of words of the specific field. To this end, in a particular application domain speech recognition, a developer may enrich the lexicon by adding some words specific to the domain to the lexicon.

However, at present, the specific words in a certain field added into the word stock are often manually selected by developers according to experience, so that the rationality is difficult to ensure, and the recognition effect of the updated word stock may be poor.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for updating a lexicon, in which a hot word in a specific field is selected by using contribution weights of training samples and pronunciation similarities of the hot word with words in an original lexicon, so as to ensure rationality of the selected hot word and improve an effect of word recognition.

The embodiment of the invention provides a word stock updating method, which comprises the following steps:

acquiring a training sample set consisting of a plurality of general recognition statement samples and a plurality of specific field statement samples from a preset general recognition corpus and a preset specific field corpus;

acquiring M words with the highest word frequency from the preset general knowledge corpus;

carrying out classification training on the classification model by adopting the training sample set to obtain a classification result corresponding to each training sample and a word set formed by words corresponding to each training sample;

selecting X words with the largest contribution weight from the word set according to the contribution weight of each word in the word set to the classification accuracy of the classification model;

determining pinyin similarity between each of the X words and the M words;

determining a hot word bank corresponding to the X words according to the comparison result of the pinyin similarity and a preset threshold;

and adding the hot word library into the original recognition word library.

The embodiment of the invention provides a word stock updating device, which comprises:

the acquisition module is used for acquiring a training sample set consisting of a plurality of general knowledge sentence samples and a plurality of specific field sentence samples from a preset general knowledge corpus and a preset specific field corpus, and acquiring M words with the highest word frequency from the preset general knowledge corpus;

the training module is used for carrying out classification training on a classification model by adopting the training sample set;

the acquisition module is also used for acquiring the classification result corresponding to each training sample and a word set formed by words corresponding to each training sample;

the selection module is used for selecting X words with the largest contribution weight from the word set according to the contribution weight of each word in the word set to the classification accuracy of the classification model;

the determining module is used for determining the pinyin similarity between each word in the X words and the M words and determining a hot word bank corresponding to the X words according to the comparison result of the pinyin similarity and a preset threshold;

and the updating module is used for adding the hot word library into the original recognition word library.

According to the word stock updating method and device provided by the embodiment of the invention, a training sample set is formed by a plurality of general sentence samples and a plurality of sentence samples in a specific field to carry out classification training on a classification model, and a word set formed by words respectively corresponding to the training samples is obtained at the output side of the classification model. And determining the contribution weight of each word in the word set to the classification accuracy of the classification model based on the classification result of each training sample so as to select the X words with the highest contribution degree to the classification accuracy of the classification model. Then, in order to avoid repeatedly adding the words existing in the original recognition word bank or the words which are particularly similar to the words existing in the original recognition word bank into the original recognition word bank, the hot words belonging to the specific field are selected from the X words based on the comparison result of the pinyin similarity and a preset threshold value by calculating the pinyin similarity between the X words and the M words with the highest word frequency in the recognition corpus respectively, so as to form a hot word bank, and the hot word bank is added into the original recognition word bank, so that the recognition effect of the words in the application scene of the specific field can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a first embodiment of a lexicon updating method according to an embodiment of the present invention;

fig. 2 is a flowchart of a second embodiment of a lexicon updating method according to the present invention;

fig. 3 is a flowchart of a third embodiment of a lexicon updating method according to the present invention;

fig. 4 is a schematic structural diagram of a first lexicon updating apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a second lexicon updating apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a third embodiment of a lexicon updating apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, but do not exclude at least one, unless the context clearly dictates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe XXX in embodiments of the present invention, these XXX should not be limited to these terms. These terms are only used to distinguish XXX from each other. For example, a first XXX may also be referred to as a second XXX, and similarly, a second XXX may also be referred to as a first XXX, without departing from the scope of embodiments of the present invention.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.

Fig. 1 is a flowchart of a first embodiment of a method for updating a lexicon according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

s101, a training sample set composed of a plurality of general recognition statement samples and a plurality of specific field statement samples is obtained from a preset general recognition corpus and a preset specific field corpus.

Alternatively, the generic corpus may be a large number of comprehensive classes of news text, such as types of text that may include politics, military, sports, art, and so on, with the text mostly in the unit of articles. The method can split each type of text of the common knowledge corpus by taking a single sentence as a unit, so as to obtain the common knowledge corpus containing a large number of common knowledge sentences.

In practical applications, a domain-specific corpus can be constructed from a large number of question sentences by collecting a domain-specific question set.

Alternatively, several sentences may be randomly selected from the general corpus and the domain-specific corpus as training samples to form a training sample set. The general sentence samples and the domain-specific sentence samples contained in the training sample set are generally equal in number.

S102, M words with the highest word frequency are obtained from a preset general recognition corpus.

Each sentence contained in the general corpus may be participled and the null words such as conjunctions, pronouns, etc. contained therein may be removed, or referred to as stop words. And (4) counting the word frequencies of the rest words, and selecting M words with the highest word frequency from the word frequencies, wherein M can be preset.

In the embodiment, the main purpose of obtaining the M words is to assist in determining the hotword belonging to a specific field by using the M words as comparison objects. Wherein, the hot words in the specific field refer to: frequently occurring in certain fields and easily overwhelmed by common words due to similar pronunciation as common or general words.

And S103, carrying out classification training on the classification model by adopting the training sample set to obtain a classification result corresponding to each training sample and a word set formed by words corresponding to each training sample.

And sequentially inputting each training sample in the training sample set into the classification model for classification training. On the output side of the classification model, the classification result corresponding to each input training sample can be output in sequence, and in this embodiment, the classification result is used to represent whether the corresponding input training sample belongs to a recognition statement or a specific field statement.

The operation process of the classification model is simply as follows: the method comprises the steps of performing word segmentation processing on an input training sample, determining a classification coefficient based on each word segmentation result, performing weighted summation on each word segmentation result based on the classification coefficient, and finally comparing a summation result with a certain classification threshold value to determine the classification of the input training sample. Therefore, in the process of performing classification training on each training sample, the word segmentation result for each training sample can also be obtained.

After all training samples are trained, word segmentation results corresponding to all training samples can be counted to obtain a word set corresponding to all training samples, namely the training sample set.

The classification model may be specifically implemented as a classifier, and various classifiers that can be provided in the prior art may be used, for example, a classifier constructed based on deep learning.

And S104, selecting X words with the maximum contribution weight from the word set according to the contribution weight of each word in the word set to the classification accuracy of the classification model.

After the classification model is trained by utilizing the training sample set, the accuracy of the classification model can be calculated. The accuracy can be determined by the ratio of the number of correctly classified training samples to the total number of training samples.

Furthermore, because the word set is composed of words corresponding to the training samples, the contribution weight of each word in the word set to the classification accuracy of the classification model can be further calculated, and the larger the contribution weight is, the larger the contribution of the word to the accuracy of the classification model is.

Alternatively, the contribution weight of each word in the word set may be obtained by calculating an information gain of each word, the greater the information gain, the higher the contribution weight of the word.

The information gain of any word T in the set of words may be expressed as: ig (T) ═ H (C) — H (C | T), where C represents a certain classification, H (C) is the information entropy of classification C, and H (C | T) includes two cases: one is that the word T appears in class C, denoted T, one is that the word T does not appear in class C, denoted T ', then H (C | T) ═ P (T) H (C | T) + P (T') H (C | T '), where P (T) is the probability that the word T appears in class C and P (T') is the probability that the word T does not appear in class C. In this embodiment, the category C may refer to a specific field.

Furthermore, the X words with the largest contribution weight are selected from the word set, and because the X words have the largest contribution degree to the classification accuracy of the classification model, the effect of the X words on the recognition result in the speech recognition scene of the specific field is important, and the hot words of the specific field are likely to be included in the X words.

S105, determining the pinyin similarity between each of the X words and the M words.

And S106, determining a hot word bank corresponding to the X words according to the comparison result of the pinyin similarity and the preset threshold.

In this embodiment, the manner of obtaining the hot words in the specific field from the X words is to screen out words with a pinyin similarity greater than a certain preset threshold from the X words based on the pinyin similarity between each of the X words and the M words, and form a hot word bank from the screened words. Where the predetermined threshold is a number less than 1, this means similar to but different from M words, which also corresponds to the meaning of a domain-specific hotword including words that are easily overwhelmed by common words.

The pinyin similarity between two words can be measured by the number of repeated letters in the pinyin of the two words.

And S107, adding the hot word library into the original recognition word library.

And adding the obtained hot word library into the original recognition word library so as to expand the hot words belonging to the specific field in the original recognition word library.

In this embodiment, a training sample set is formed by a plurality of general sentence samples and a plurality of specific field sentence samples to perform classification training on the classification model, and a word set formed by words respectively corresponding to each training sample is obtained on the output side of the classification model. And determining the contribution weight of each word in the word set to the classification accuracy of the classification model based on the classification result of each training sample so as to select the X words with the highest contribution degree to the classification accuracy of the classification model. Then, in order to avoid repeatedly adding the words existing in the original recognition word bank or the words which are particularly similar to the words existing in the original recognition word bank into the original recognition word bank, the hot words belonging to the specific field are selected from the X words based on the comparison result of the pinyin similarity and a preset threshold value by calculating the pinyin similarity between the X words and the M words with the highest word frequency in the recognition corpus respectively, so as to form a hot word bank, and the hot word bank is added into the original recognition word bank, so that the recognition effect of the words in the application scene of the specific field can be improved.

Fig. 2 is a flowchart of a second embodiment of a lexicon updating method according to an embodiment of the present invention, and as shown in fig. 2, the method may include the following steps:

s201, a training sample set composed of a plurality of general knowledge sentence samples and a plurality of specific field sentence samples is obtained from a preset general knowledge corpus and a preset specific field corpus, and M words with the highest word frequency are obtained from the preset general knowledge corpus.

The implementation process of the above steps can refer to the related description in the embodiment shown in fig. 1, and is not described herein again.

S202, generating N training sample subsets according to the training sample sets, wherein each training sample subset comprises a plurality of general sentence samples and a plurality of specific field sentence samples.

In this embodiment, the classification model may include N classification submodels, where the N classification submodels may be the same or different. For the N classification submodels, N training sample subsets need to be generated for training the N classification submodels, respectively. The N training sample subsets may be generated by randomly selecting several general sentence samples and domain-specific sentence samples in the training sample set as one training sample subset. It should be noted that the same sentence sample may be repeatedly appeared in a plurality of training sample subsets, that is, the training samples in the N training sample subsets may have a small overlap.

In this embodiment, by setting the N classification submodels, the processing efficiency in the training process can be improved.

And S203, carrying out classification training on the ith classification submodel in the N classification submodels by adopting the ith training sample subset in the N training sample subsets to obtain the classification result corresponding to each training sample in the ith training sample subset and the ith word set formed by words corresponding to each training sample, wherein i is 1,2, … and N.

The N training sample subsets correspond to the N classification submodels one to one, and the classification submodels are trained by using the training sample subsets corresponding to each classification submodel, and the training process is consistent with the description in the foregoing embodiment and is not repeated. Only, in this embodiment, the word sets respectively corresponding to each classification submodel are obtained after the training is completed.

S204, according to the contribution weight of each word in the ith word set to the classification accuracy of the ith classification submodel, selecting Y words with the largest contribution weight from the ith word set.

For the calculation process of the classification accuracy and the contribution weight, reference may be made to the description in the foregoing embodiments, and details are not repeated. The setting of Y may be, for example, a value greater than or equal to X/N.

And S205, selecting X words with the maximum contribution weight from the Y words according to the contribution weights corresponding to the Y words with the maximum contribution weight.

And summarizing Y words respectively selected from the N classification submodels, and selecting X words with the maximum contribution weight according to the contribution weights of the Y X N words.

In fact, repeated words may appear in Y × N words, and therefore, in this embodiment, the selection of X words is selected according to the cumulative contribution weight of Y × N words. Specifically, for any word i in the Y × N words, the cumulative formula of the contribution weight is as follows:

wherein, w_iIs the final contribution weight of the word i, wp_j,iFor the contribution weight of the word i in the jth classification submodel, A_jFor the accuracy of the jth classification submodel, the accuracy may be optionally represented by an F1 score.

S206, for any word Xi in the X words, screening K words with the word number consistent with that of the word Xi from the M words.

And aiming at any word Xi in the X words, selecting K words with the same word number as the word Xi from M words with the highest occurrence frequency of the knowledge corpus so as to compare the pinyin similarity between the two words. The words with the same number of words are selected to avoid the adverse effect of different numbers of words on the calculation of the pinyin similarity between the two words.

S207, pinyin transformation is respectively carried out on any word Ki in the words Xi and the K words.

And performing pinyin transformation on the word Xi and any word Ki in the K words with the same word number as the word Xi, wherein the specific pinyin transformation rule is as follows:

(1) degeneration of the consonant of the waned tongue: ch- > c ^, zh- > z ^ s, sh- > s ^ s, r- >;

(2)

variation of initial consonants:

ue->v+；

(3) [ e ] transforming: the pinyin contains pronunciation similar to English phonetic symbol [ e ], and a special transformation is carried out on the pronunciation, namely ue- > v +, yan, -ian is transformed into y + n, -i + n, but-iang is unchanged;

(4) tilting the tongue, i at the back of the flat tongue is converted into \, such as si- > s \;

(5) zero initial syllable conversion: wan- > ua, yi- > i;

(6) special changes wen- > un, but weng- > ueng do not.

Based on the pinyin transformation, the interference of letters with small influence on pronunciation similarity in words on the pinyin similarity can be reduced, and the accuracy of calculation results of the pinyin similarity of the words Xi and Ki is ensured.

S208, determining the pinyin similarity between the word Xi and the word Ki after pinyin transformation.

And (3) performing pinyin transformation on the word Xi and the word Ki, assuming that the word Xi is transformed into the word Yi after the pinyin transformation, the word Ki is transformed into the word Mi after the pinyin transformation, and comparing the pinyin similarity between the word Yi and the word Mi. The following describes the determination process of the pinyin similarity between two words in detail by way of example.

It is assumed that both words Xi and Ki are two-word words, which may be denoted AB and CD, respectively. Firstly, pinyin transformation is carried out on the word Xi and the word Ki to obtain a word Yi and a word Mi. And determining the pinyin similarity of the single character A in the term Yi and the single character C in the term Mi and the pinyin similarity of the single character B in the term Yi and the single character D in the term Mi respectively, and finally determining the overall pinyin similarity of the term Yi and the term Mi according to the pinyin similarity between the single characters.

Specifically, the pinyin of the word Xi and the pinyin of the word Ki are subjected to pinyin conversion according to a pinyin conversion rule, so that the pinyin after the conversion of the single character A and the single character B in the word Xi and the pinyin after the conversion of the single character C and the single character D in the word Ki are respectively obtained. The pinyin similarity between the single character a and the single character C, and the pinyin similarity between the single character B and the single character D can be determined by using the following formula (1).

LCS_Sim(PYstring_Yi,j,PYstring_Mi,j)＝ToneWeight*(SMLCS_Sim(SMstring_Yi,j,SMstring_Mi,j)+ (1)

YMLCS_Sim(YMstring_Yi,j,YMstring_Mi,j))/2

Wherein, the Longest Common Subsequence (LCS for short), ToneWieght is a tone weight, which specifically refers to the similarity of tones of two single words, and if the tones of the two single words are the same, ToneWieght is 1, and if the tones are different, ToneWieght is set to a larger value smaller than 1, which may be 0.8-0.98.

SMLCS_Sim(SMstring_Yi,j,SMstring_Mi,j) And YMLCS _ Sim (YMstring)_Yi,j,YMstring_Mi,j) The initial consonant similarity and the final sound similarity of the jth word Yi and the jth word Mi are respectively determined according to the following formulas (2) and (3):

wherein, length (LCS (SMstring)_Yi,j,SMstring_Mi,j) Length (SMstring) being the length of the longest common pinyin for the initial between the jth word Yi and the jth word Mi_Yi,j) And length (SMstring)_Mi,j) The lengths of the pinyin initial consonants of the jth character of the Yi word and the jth character of the Mi word are respectively; length (LCS (YMstring)_Yi,j,YMstring_Mi,j) Length (LCS) which is the length of the longest common pinyin of the initial consonants between the jth word Yi and the jth word Mi_Yi,j,YMstring_Mi,j) Is the jth word of the word YiAnd the length, length (YMstring) of the longest common pinyin of the vowels between the jth character of the word Mi_Yi,j) And length (YMstring)_Mi,j) The lengths of the Pinyin vowels of the jth character of the Yi word and the jth character of the Mi word are respectively.

The pinyin similarity of the single character A and the single character C and the pinyin similarity of the single character B and the single character D can be respectively determined through the processes, and the pinyin similarity of the words Yi and Mi can be determined through the following formula.

Wherein PYstring_Yi,jIs the pinyin of the jth word in the word Yi, PYString_Mi,jIs the pinyin of the jth word in the word Mi, n is the word number of the word Xi and the word Ki, LCS _ Sim (PYString)_Yi,j,PYString_Mi,j) The pinyin similarity between the Yi word and the J-th word of the Mi word is expressed, and the pinyin similarity can be obtained by calculation according to a formula (1).

S209, judging whether the pinyin similarity between the word Xi and the word Ki after pinyin conversion is greater than or equal to a preset threshold value, if so, executing a step S210, otherwise, executing a step S211

S210, determining the word Xi as a hot word in the specific field, and adding the word Xi into a hot word library.

S211, determining the word Xi as a candidate hot word, and adding the word Xi into a candidate word bank.

It is noted that the predetermined threshold is a value less than 1. Not equal to 1 means that the word Xi and the word Ki cannot be the same word, that is, a word that already exists in M words out of X words is not selected as a hotword.

S212, adding the hot word library into the original recognition word library.

In this embodiment, based on the training process on the training sample set, X words with the highest contribution degree to the classification accuracy are finally obtained. The higher the contribution, the more frequently the words appear, and the domain-specific hotwords are often included in the X words. Furthermore, in order to avoid the adverse effect of different word counts on the calculation of the pinyin similarity between two words, for any word Xi in the X words, K words with the same word count are selected from M words with the highest word frequency in the recognition corpus. And then, pinyin transformation is respectively carried out on the terms Xi and the K terms so as to reduce the interference of letters with less influence on pronunciation similarity in the terms on the pinyin similarity, and the accuracy of calculation results of the pinyin similarity of the terms Xi and the K terms is ensured, thereby ensuring the accuracy of the finally selected hot words in the specific field.

Fig. 3 is a flowchart of a third embodiment of a lexicon updating method according to the present invention. In practical application, the upper limit of the capacity of the hot word library can be set, and if the number of the hot words in the specific field in the hot word library obtained actually is larger than the upper limit of the capacity, part of the hot words are deleted from the hot word library; conversely, if the number of the domain-specific hot words in the hot word library actually obtained is less than the upper limit of the capacity, a part of the hot words is added to the hot word library. As shown in fig. 3, on the basis of the embodiment shown in fig. 2, after step S212, the method may include the following steps:

s301, judging whether the number of the hot words in the specific field in the hot word library is larger than a preset number, if so, executing a step S302; otherwise, step S303 is executed.

The preset number is the upper limit of the capacity corresponding to the hot word library.

S302, deleting the deficit quantity of the specific domain hot words from the hot word library according to the sequence of the contribution weights of the specific domain hot words in the hot word library from small to large.

The contribution weight of each specific field hot word is calculated in the process of selecting the specific field hot words to be added into the hot word library, at the moment, the contribution weights can be directly sorted from small to large, and the specific field hot words with the smallest difference quantity of the contribution weights are deleted, wherein the difference quantity is the difference between the specific field hot word quantity in the current hot word library and the upper limit quantity of the words in the hot word library.

Optionally, after the contribution weights are ranked, the hot words in the specific field are tested according to the contribution weights from small to large, that is, the original recognition word stock is used, and the hot words in the specific field are sequentially recognized according to the sequence of the contribution weights from small to large. For a specific domain hot word, if the specific domain hot word can be identified by using the original identification word bank, deleting the specific domain hot word from the hot word bank, that is, sequentially deleting the specific domain hot words which can be identified by the original identification word bank in the hot word bank according to the sequence of the contribution weights from small to large until the number of the deleted specific domain hot words is equal to the difference number.

It should be noted that, if all the domain-specific hotwords in the hotword library have been traversed in the order of the contribution weights from small to large, the number of deletions is still smaller than the amount of the deficit. At this time, the remaining domain-specific hot words may be deleted directly in the order of the contribution weights from small to large until the number of domain-specific hot words in the hot word library satisfies the preset number.

And S303, selecting the deficit number of candidate hot words from the candidate word library according to the sequence of the contribution weights of the candidate hot words in the candidate word library from large to small, and adding the selected candidate hot words into the hot word library.

If the number of the hot words in the specific field in the hot word library is smaller than the preset number, sorting the contribution weights of the candidate hot words in the candidate word library from large to small according to the weights, selecting the candidate hot words with the highest contribution weight of the difference number from the candidate word library, and adding the candidate hot words into the hot word library, wherein the difference number is the difference between the preset number and the number of the hot words in the specific field.

Optionally, after the contribution weights are ranked, the candidate hot words in the candidate word library are tested according to the contribution weights from large to small, that is, the original recognition word library is used to sequentially recognize each candidate hot word according to the sequence of the contribution weights from large to small. For a hot word in a specific field, if the candidate hot word cannot be identified by using the original identification word bank, adding the candidate hot word into the hot word bank, namely sequentially adding the candidate hot words which cannot be identified by the original identification word bank into the hot word bank from large to small according to contribution weight until the number of the hot words in the specific field in the hot word bank meets a preset number.

It should be noted that, if the number of the added hot words in the candidate lexicon is still smaller than the deficit number after all the candidate hot words in the candidate lexicon are tested according to the contribution weights with the maximum contribution weights, at this time, the remaining candidate hot words can be sequentially added into the hot lexicon according to the descending contribution weights until the number of the hot words in the specific field in the hot lexicon meets the preset number.

In this embodiment, when a specific domain hot word deletion or addition process needs to be performed on an actually obtained hot word library based on a preset upper limit of the capacity of the hot word library, the deletion or addition process may be performed in combination with the contribution weight and whether the original recognized word library can recognize a word to be deleted or added, so that the reliability of the deletion or addition process may be ensured, and a non-specific domain hot word may be deleted from the hot word library or a word more likely to be a specific domain hot word may be added to the hot word library.

Fig. 4 is a schematic structural diagram of a first embodiment of a word stock updating device according to an embodiment of the present invention, and as shown in fig. 4, the word stock updating device includes: the device comprises an acquisition module 11, a training module 12, a selection module 13, a determination module 14 and an updating module 15.

The obtaining module 11 is configured to obtain a training sample set composed of a plurality of recognition sentence samples and a plurality of domain-specific sentence samples from a preset recognition corpus and a preset domain-specific corpus, and obtain M words with the highest word frequency from the preset recognition corpus.

And the training module 12 is configured to perform classification training on the classification model by using a training sample set.

The obtaining module 11 is further configured to obtain a classification result corresponding to each training sample and a word set formed by words corresponding to each training sample.

And the selecting module 13 is configured to select, according to the contribution weight of each word in the word set to the classification accuracy of the classification model, X words with the largest contribution weight from the word set.

The determining module 14 is configured to determine pinyin similarity between each of the X words and the M words, and determine a hot lexicon corresponding to the X words according to a comparison result between the pinyin similarity and a preset threshold.

And the updating module 15 is used for adding the hot word library into the original recognition word library.

The apparatus shown in fig. 4 can perform the method of the embodiment shown in fig. 1, and reference may be made to the related description of the embodiment shown in fig. 1 for a part of this embodiment that is not described in detail. The implementation process and technical effect of the technical solution refer to the description in the embodiment shown in fig. 1, and are not described herein again.

Fig. 5 is a schematic structural diagram of a second embodiment of the word stock updating apparatus according to the embodiment of the present invention, and as shown in fig. 5, on the basis of the embodiment shown in fig. 4, the training module 12 may specifically include: a generating unit 121 and an obtaining unit 122.

The generating unit 121 is configured to generate N training sample subsets according to a training sample set, where each training sample subset includes a plurality of generic sentence samples and a plurality of domain-specific sentence samples.

The obtaining unit 122 is configured to perform classification training on an ith classification sub-model in the N classification sub-models by using an ith training sample sub-set in the N training sample sub-sets, to obtain a classification result corresponding to each training sample in the ith training sample sub-set and an ith word set formed by words corresponding to each training sample, where i is 1,2, …, and N.

Accordingly, the selection module 13 is further configured to: and selecting Y words with the maximum contribution weight from the ith word set according to the contribution weight of each word in the ith word set to the classification accuracy of the ith classification submodel, and selecting X words with the maximum contribution weight from the Y words according to the respective corresponding contribution weights of the selected Y words.

Optionally, the determining module 14 may include: a screening unit 141, a pinyin conversion unit 142 and a determination unit 143.

And the screening unit 141 is configured to screen, for any word Xi of the X words, K words of which the word number is consistent with that of the word Xi from the M words.

And a pinyin transformation unit 142, configured to perform pinyin transformation on any one of the terms Xi and K terms respectively.

The determining unit 143 is configured to determine the pinyin similarity between the word Xi and the word Ki after the pinyin transformation.

Optionally, the determining unit 143 is specifically configured to:

determining the pinyin similarity PY _ Sim (wordYi, wordMi) between the word Xi and the word Ki after pinyin transformation according to the following formula:

wherein Yi is a word obtained by transforming the word Xi through the pinyin transformation, Mi is a word obtained by transforming the word Ki through the pinyin transformation, and PYstring_Yi,jPinyin, PYstring, representing the jth word in the word Yi_Mi,jThe Pinyin representing the jth word in the word Mi, LCS _ Sim (PYstring)_Yi,j,PYstring_Mi,j) Representing the pinyin similarity between the jth word in the word Yi and the jth word in the word Mi, n being the word length of the word Yi and the word Mi,

wherein LCS _ Sim (PYstring)_Yi,j,PYstring_Mi,j) Determined according to the following formula:

LCS_Sim(PYstring_Yi,j,PYstring_Mi,j)＝ToneWeight*(SMLCS_Sim(PYstring_Yi,j,PYstring_Mi,j)+

YMLCS_Sim(PYstring_Yi,j,PYstring_Mi,j))/2

ToneWeight is the tone weight, SMLCS _ Sim (SMstring)_Yi,j,SMstring_Mi,j) And YMLCS _ Sim (YMstring)_Yi,j,YMstring_Mi,j) The initial similarity and the final similarity of the jth character of the word Yi and the jth character of the word Mi respectively, wherein,

wherein, length (LCS (SMstring)_Yi,j,SMstring_Mi,j) Length (SMstring) is the length of the common pinyin for the initial between the jth word Yi and the jth word Mi_Yi,j) And length (SMstring)_Mi,j) The lengths of the pinyin initial consonants of the jth character of the Yi word and the jth character of the Mi word respectively,

length(LCS(YMstring_Yi,j,YMstring_Mi,j) Length (YMstring) which is the length of the public pinyin of the vowels between the jth word Yi and the jth word Mi_Yi,j) And length (YMstring)_Mi,j) The lengths of the Pinyin vowels of the jth character of the Yi word and the jth character of the Mi word are respectively.

Accordingly, the determining module 14 is specifically configured to: if the pinyin similarity between the word Xi and the word Ki after pinyin conversion is larger than or equal to a preset threshold value, determining that the word Xi is a hot word in a specific field, and adding the word Xi into a hot word library; and if the pinyin similarity between the word Xi and the word Ki after pinyin transformation is smaller than a preset threshold, determining the word Xi as a candidate hot word, and adding the word Xi into a candidate word bank.

The apparatus shown in fig. 5 can perform the method of the embodiment shown in fig. 2, and reference may be made to the related description of the embodiment shown in fig. 2 for a part of this embodiment that is not described in detail. The implementation process and technical effect of the technical solution refer to the description in the embodiment shown in fig. 2, and are not described herein again.

Fig. 6 is a schematic structural diagram of a third embodiment of a word stock updating device according to an embodiment of the present invention, as shown in fig. 6, based on the embodiment shown in fig. 5, the word stock updating device further includes: a deleting module 21 and an adding module 22.

The deleting module 21 is configured to delete specific-field hotwords with a deficit number from the hotword library according to a sequence that contribution weights of the specific-field hotwords in the hotword library are from small to large if the number of the specific-field hotwords in the hotword library is greater than a preset number, where the deficit number is a difference between the number of the specific-field hotwords and the preset number; the deleted domain-specific hotwords are domain-specific hotwords that can be recognized by the original recognized lexicon.

The adding module 22 is configured to, if the number of the specific domain hot words in the hot word library is smaller than the preset number, select a deficit amount of candidate hot words from the candidate word library according to a descending order of the contribution weights of the candidate hot words in the candidate word library, add the selected candidate hot words to the hot word library, where the deficit amount is a difference between the preset number and the number of the specific domain hot words, and the selected candidate hot words are candidate hot words that cannot be recognized by the original recognized word library.

The apparatus shown in fig. 6 can perform the method of the embodiment shown in fig. 3, and reference may be made to the related description of the embodiment shown in fig. 3 for a part of this embodiment that is not described in detail. The implementation process and technical effect of the technical solution refer to the description in the embodiment shown in fig. 3, and are not described herein again.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and certainly, the embodiments can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., which includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A word stock updating method is characterized by comprising the following steps:

determining pinyin similarity between each of the X words and the M words;

and adding the hot word library into the original recognition word library.

2. The method of claim 1, wherein the classification model comprises N classification submodels, and the performing classification training on the classification model using the training sample set comprises:

generating N training sample subsets according to the training sample set, wherein each training sample subset comprises a plurality of general sentence samples and a plurality of specific field sentence samples;

classifying and training an ith classification sub-model in the N classification sub-models by adopting an ith training sample sub-set in the N training sample sub-sets to obtain a classification result corresponding to each training sample in the ith training sample sub-set and an ith word set formed by words corresponding to each training sample, wherein i is 1,2, … and N;

the selecting, according to the contribution weight of each word in the word set to the classification accuracy of the classification model, X words with the largest contribution weight from the word set includes:

according to the contribution weight of each word in the ith word set to the classification accuracy of the ith classification submodel, selecting Y words with the largest contribution weight from the ith word set;

and selecting X words with the maximum contribution weight from the Y words according to the contribution weights corresponding to the Y words.

3. The method of claim 1, wherein the determining the pinyin similarity between each of the X terms and the M terms, respectively, comprises:

for any word Xi in the X words, screening K words with the word number consistent with that of the word Xi from the M words;

pinyin transformation is respectively carried out on the word Xi and any word Ki in the K words;

and determining the pinyin similarity between the word Xi and the word Ki after the pinyin transformation.

4. The method of claim 3, wherein determining the pinyin similarity between the pinyin-transformed terms Xi and Ki comprises:

LCS_Sim(PYstring_Yi,j,PYstring_Mi,j)＝ToneWeight*(SMLCS_Sim(SMstring_Yi,j,SMstring_Mi,j)+YMLCS_Sim(YMstring_Yi,j,YMstring_Mi,j))/2

length(LCS(YMstring_Yi,j,YMstring_Mi,j) Common pinyin for vowels between the jth word Yi and the jth word Mi)Length (YMstring)_Yi,j) And length (YMstring)_Mi,j) The lengths of the Pinyin vowels of the jth character of the Yi word and the jth character of the Mi word are respectively.

5. The method of claim 3, wherein determining the hot word bank corresponding to the X words according to the comparison result of the pinyin similarity with a preset threshold comprises:

and if the pinyin similarity between the word Xi and the word Ki after the pinyin conversion is greater than or equal to a preset threshold value, determining that the word Xi is a hot word in a specific field, and adding the word Xi into a hot word library.

6. The method of claim 5, further comprising:

if the number of the specific domain hot words in the hot word bank is larger than the preset number, deleting the deficit number of the specific domain hot words from the hot word bank according to the sequence of the contribution weights of the specific domain hot words in the hot word bank from small to large;

the difference amount is the difference value between the specific field hot word amount and the preset amount; the deleted domain-specific hot words are domain-specific hot words which can be recognized by the original recognized word stock.

7. The method of claim 5, further comprising:

and if the pinyin similarity between the word Xi and the word Ki after the pinyin transformation is smaller than a preset threshold value, determining the word Xi as a candidate hot word, and adding the word Xi into a candidate word bank.

8. The method of claim 7, further comprising:

if the number of the hot words in the specific field in the hot word bank is smaller than the preset number, selecting the deficit number of the candidate hot words from the candidate word bank according to the sequence of the contribution weights of the candidate hot words in the candidate word bank from large to small, and adding the selected candidate hot words into the hot word bank;

the difference amount is the difference between the preset amount and the specific field hot word amount, and the selected candidate hot words are candidate hot words which cannot be identified by the original identification word bank.

9. A thesaurus updating apparatus, comprising:

10. The apparatus of claim 9, wherein the classification model includes N classification submodels, and the training module specifically includes:

the generating unit is used for generating N training sample subsets according to the training sample set, wherein each training sample subset comprises a plurality of general sentence samples and a plurality of specific field sentence samples;

an obtaining unit, configured to perform classification training on an ith classification sub-model in the N classification sub-models by using an ith training sample sub-set in the N training sample sub-sets, to obtain a classification result corresponding to each training sample in the ith training sample sub-set and an ith word set formed by words corresponding to each training sample, where i is 1,2, …, N;

the selection module is further configured to: and selecting Y words with the maximum contribution weight from the ith word set according to the contribution weight of each word in the ith word set to the classification accuracy of the ith classification submodel, and selecting X words with the maximum contribution weight from the Y words according to the respective corresponding contribution weights of the selected Y words.