CN112185346B

CN112185346B - Multilingual voice keyword detection and model generation method and electronic equipment

Info

Publication number: CN112185346B
Application number: CN202011026187.4A
Authority: CN
Inventors: 左祥; 江之源; 姚宇行; 刘译璟; 苏萌; 高体伟
Original assignee: Beijing Percent Technology Group Co ltd
Current assignee: Beijing Percent Technology Group Co ltd
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2022-11-11
Anticipated expiration: 2040-09-25
Also published as: CN112185346A

Abstract

The application discloses a multilingual voice keyword detection and keyword model generation method, an electronic device and a computer readable storage medium, wherein the keyword model generation method comprises the following steps: acquiring a plurality of keyword texts corresponding to different languages; respectively converting the plurality of keyword texts corresponding to different languages into phoneme sequences corresponding to the languages; converting the phoneme sequence corresponding to the language into a phoneme sequence of the target language based on the mapping relation between phonemes of different languages and phonemes of the target language; and generating a keyword model corresponding to the plurality of keyword texts corresponding to different languages according to the converted phoneme sequence of the target language. The method and the device can improve the detection efficiency of the multi-language voice keywords.

Description

Multilingual voice keyword detection and model generation method and electronic equipment

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a multi-language audio keyword detection method, a keyword model generation method, an electronic device, and a computer-readable storage medium.

Background

One difficulty in the task of detecting multilingual speech keywords is the fact that multiple languages are mixed, for example, a sentence is generated in multiple languages. Language identification often requires audio of sufficient duration, such as 5-10 seconds. The duration of the multilingual mixing in a sentence may be relatively short. For example, certain english names may have durations of less than 1 second, which the speech recognition cannot handle. In addition, this technical solution needs to train the language identification model and the keyword model in advance. Both models need to prepare a large amount of training audio in advance, the more the number of languages is, the higher the cost of acquiring the training audio is, and particularly for many languages, the difficulty of acquiring training data is very high.

How to improve the detection efficiency of the multilingual voice keywords is a technical problem to be solved by the application.

Disclosure of Invention

An object of the embodiments of the present application is to provide a multilingual speech keyword detection method, a keyword model generation method, an electronic device, and a computer-readable storage medium, so as to solve the problem of low detection efficiency of multilingual speech keywords.

In order to solve the above technical problem, the present specification is implemented as follows:

in a first aspect, a keyword model generation method is provided, including: acquiring a plurality of keyword texts corresponding to different languages; respectively converting the plurality of keyword texts corresponding to different languages into phoneme sequences corresponding to the languages; converting a phoneme sequence corresponding to the language into a phoneme sequence of the target language based on a mapping relation between phonemes of different languages and phonemes of the target language; and generating a keyword model corresponding to the plurality of keyword texts corresponding to different languages according to the converted phoneme sequence of the target language.

Optionally, the same phoneme in the phoneme sequence corresponding to the language corresponds to at least one phoneme in the phoneme sequence of the target language.

Optionally, when the same phoneme corresponds to a plurality of different phonemes in the phoneme sequence of the target language, the plurality of different phonemes respectively have corresponding weights, and the weights represent probabilities that the same phoneme is represented as each of the plurality of different phonemes.

Optionally, generating, according to the converted phoneme sequence of the target language, a keyword model corresponding to the plurality of keyword texts corresponding to different languages, including:

and generating a plurality of different keyword models corresponding to the same phoneme respectively according to the plurality of different phonemes in the target language so as to generate a plurality of keyword models corresponding to the keyword texts corresponding to different languages.

In a second aspect, a method for detecting a multi-language voice keyword is provided, including: receiving a voice to be detected; carrying out segmentation processing on the voice to be detected to obtain a plurality of audio segments: converting each audio clip into a corresponding audio feature; inputting the audio features into the keyword model according to the first aspect to calculate to obtain keyword probabilities of corresponding audio segments; and detecting the keywords in the audio clip according to the keyword probability.

Optionally, converting each audio segment into a corresponding audio feature includes: determining a number of audio frames of the audio clip; converting each audio frame of the audio segment into a set of audio feature values of corresponding dimensions; and determining an audio feature matrix corresponding to the audio segments according to the number of the audio frames and the dimension.

Optionally, inputting the audio features into the keyword model for calculation to obtain a keyword probability of a corresponding audio segment, including: forming a network comprising a plurality of node states based on phonemes of the target language keyword corresponding to the keyword model and the audio frames of the audio segments, wherein each phoneme and a corresponding audio frame form a node state; taking the audio features corresponding to each audio frame as parameters, and calculating the posterior probability of each node state by using a predetermined algorithm; and determining the maximum value of the posterior probabilities in the plurality of node states to serve as the keyword probability of the audio clip.

Optionally, when the phoneme of the target language keyword corresponding to the keyword model includes a plurality of different phonemes corresponding to the same phoneme of the language corresponding to the speech to be detected, calculating the posterior probability of each node state by using a predetermined algorithm, further including: and respectively carrying out weighted calculation on the posterior probabilities of the node states corresponding to the phonemes according to the weights of the different phonemes corresponding to the same phoneme so as to obtain the posterior probabilities of the node states.

In a third aspect, an electronic device is provided, comprising a processor and a processor electrically connected to the memory, the memory storing a computer program executable by the processor, the computer program, when executed by the processor, implementing the steps of the method according to the first or second aspect.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method according to the first or second aspect.

In the embodiment of the application, a plurality of keyword texts corresponding to different languages are obtained, the plurality of keyword texts corresponding to different languages are respectively converted into phoneme sequences corresponding to the languages, and based on the mapping relation between phonemes of different languages and phonemes of a target language, the phoneme sequences corresponding to the languages are converted into the phoneme sequences of the target language, and according to the phoneme sequences of the converted target language, keyword models corresponding to the plurality of keyword texts corresponding to different languages are generated, so that keywords of multi-language voices can be accurately identified, required training audios generated by the keyword models are reduced, and the efficiency of detecting the multi-language voice keywords is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flowchart of a keyword model generation method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a keyword model generation framework according to an embodiment of the present application.

FIG. 3 is a diagram illustrating an example of a keyword model according to an embodiment of the present application.

FIG. 4 is a flowchart illustrating a multilingual speech keyword detection method according to an embodiment of the present application.

Fig. 5 is a schematic diagram of keyword probability calculation according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. The reference numbers in the present application are only used for distinguishing the steps in the scheme, and are not used for limiting the execution sequence of the steps, and the specific execution sequence is subject to the description in the specification.

In order to solve the problems in the prior art, an embodiment of the present application provides a keyword model generation method, and fig. 1 is a schematic flow diagram of the keyword model generation method in the embodiment of the present application.

As shown in fig. 1, the method comprises the steps of:

step 102: acquiring a plurality of keyword texts corresponding to different languages;

step 104: respectively converting the keyword texts corresponding to different languages into phoneme sequences corresponding to all languages;

step 106: converting the phoneme sequence corresponding to the language into a phoneme sequence of the target language based on the mapping relation between phonemes of different languages and phonemes of the target language;

step 108: and generating a keyword model corresponding to the plurality of keyword texts corresponding to different languages according to the converted phoneme sequence of the target language.

The keyword text includes words of the keywords to be recognized for generating the keyword model, where the keyword text may be in a plurality of languages, such as russian, english, arabic, and the like. For example, english is the keyword text "attach".

In step 104, each keyword text correspondence is converted into a phoneme sequence of its corresponding language, for example, as shown in step 104 of fig. 2, the phoneme of each letter or character included in the keywords of different languages can be known by using the multilingual pronunciation dictionary 204.

Such as Kazak keywords

Converted into the corresponding Chinese phoneme sequence "avtorgha", the Chinese keyword "assault", and converted into the corresponding Chinese phoneme sequence "xiji". Here, a multilingual pronunciation dictionary prepared in advance is used in the phoneme conversion step.

In step 106, the keyword texts in different languages are converted into phoneme sequences corresponding to the converted phoneme sequences, and then converted into phonemes in a common language, wherein the common language is a common language that is converted from different languages in the step, and the common language is usually selected from international common languages, such as chinese, english, and the like. The universal languages have wide application range and high application frequency. For example, if Chinese is selected as the universal target language, the keyword phonemes for all languages are converted into phonemes for Chinese in step 106. Here, a phoneme mapping list is needed, which includes mapping relationships between phonemes of all languages in the task and phonemes of the corresponding target language. As shown in fig. 2, the phoneme mapping list 206 may be used to implement the phoneme conversion of the target language in step 106, and convert the phoneme sequences corresponding to different languages into phoneme sequences of the target language.

This mapping may be one-to-many due to pronunciation differences between different languages. For example, the same english phoneme may correspond to multiple chinese phonemes in different pronunciation contexts. Optionally, the same phoneme in the phoneme sequence corresponding to the language corresponds to at least one phoneme in the phoneme sequence of the target language. When the same phoneme corresponds to a plurality of different phonemes in the phoneme sequence of the target language, the plurality of different phonemes respectively have corresponding weights, the weights indicating the probability that the same phoneme is represented as each of the plurality of different phonemes.

An example of the phoneme mapping list is shown in table 1, and table 1 shows the phoneme mapping relationship between the kazak language and the chinese language.

For example, the Kazakh phonemes "v" and "gh" can be represented by a plurality of Chinese phonemes due to the difference in Kazakh and the intermediate Wen Fayin. As shown in Table 1, the Kazakh phoneme "v" can be represented by the Chinese phonemes "b" and "p", and the Kazakh phoneme "gh" can be represented by the Chinese phonemes "f", "h", "i", and "a". When a plurality of different phonemes of other languages correspond to the same phoneme, the different phonemes respectively have corresponding weights, and the sum of the weights is 1. The weights represent the probability that the phoneme corresponds to the corresponding chinese phoneme, and the higher the probability, the higher the probability that the phoneme = to the corresponding chinese phoneme. For example, the probability that the Kazakstan phoneme "v" is represented by the Chinese phoneme "b" is 0.7, and the probability that the Chinese phoneme and "p" are represented is 0.3, i.e., the higher the probability that the Kazakstan phoneme "v" is represented by the Chinese phoneme "b". The probability can be obtained empirically or by statistical methods.

TABLE 1

In step 108, a corresponding keyword model is generated based on the converted target language phoneme sequence. Each keyword text generates a target language phoneme corresponding to the text language, and further generates a corresponding target language keyword model. The plurality of keyword texts generate a plurality of keyword models corresponding to the target language, and the keyword models can be combined to form a keyword model network for subsequent multi-language voice keyword detection.

Optionally, generating, according to the converted phoneme sequence of the target language, a keyword model corresponding to the plurality of keyword texts corresponding to different languages, including: and generating a plurality of different keyword models corresponding to the same phoneme respectively according to the different phonemes in the target language so as to generate a plurality of keyword models corresponding to the keyword texts corresponding to different languages.

According to the keyword model generation method, a plurality of keywords corresponding to different languages are converted into phoneme sequences corresponding to the keyword audio of the general target language, and then the target keyword model is constructed based on the phoneme sequences. According to the method and the device, the unified and universal keyword model can be generated for different languages including various languages through the universal target keyword model, so that the training audio required by the keyword model can be reduced, and the audio training cost is reduced.

FIG. 3 shows an example of a keyword model, and the Kazakh keyword text corresponding to the keyword model shown in FIG. 3 is

The corresponding Kazakh phoneme sequence is "avtorgha", and is converted into a keyword model generated after the target language Chinese phoneme sequence is converted according to the mapping relation and the weight value in the table 1. As can be seen from fig. 3, there are several phonemes having a plurality of correspondences.

When the keyword model is generated, a general acoustic model 208 in a target language, which is trained in advance, is required, for example, if chinese is used as the general language, a chinese acoustic model needs to be trained. In one embodiment, the generic acoustic model 208 may be based on an HMM-DNN (hidden Markov model-deep neural network) architecture, or a traditional acoustic model based on an HMM-LSTM (hidden Markov model-long short term memory network) architecture. Chinese replaces the phonemes of the Chinese, so that the pronunciation of the corresponding relation can be provided by the generic acoustic model 208 according to the pronunciation of Chinese corresponding to the pronunciation of the Chinese.

After generating the keyword model, to solve the problems existing in the prior art, an embodiment of the present invention provides a multilingual speech keyword detection method, as shown in fig. 4, including the following steps:

step 302: receiving a voice to be detected;

step 304: carrying out segmentation processing on the voice to be detected to obtain a plurality of audio segments:

step 306: converting each audio clip into a corresponding audio feature;

step 308: inputting the audio features into a keyword model for calculation to obtain keyword probability of the corresponding audio clip;

step 310: and detecting the keywords in the audio clip according to the keyword probability.

In step 302, the received audio to be detected may be a multi-language mixed audio, which is used as the input speech to be detected.

In step 304, after the segmentation process and the endpoint detection, the mute part of the audio is removed and not processed, and the fixed-length audio segment is output, and the decoded audio segment unit is fixed, thereby facilitating the subsequent detection process. A fixed length audio clip, e.g., 500 milliseconds, is output, one audio clip comprising a plurality of audio frames.

After step 304, pre-signal processing may be performed on the audio segments, where common signal processing includes noise suppression, echo cancellation, dereverberation, speech enhancement, and the like, and the audio after signal processing is clearer, which is beneficial for subsequent detection processing.

In step 306, the processed audio segment is subjected to a feature extraction module to transform the audio signal into an audio feature. Optionally, converting each audio segment into a corresponding audio feature includes: determining a number of audio frames of the audio clip; converting each audio frame of the audio segment into a set of audio feature values of corresponding dimensions; and determining an audio feature matrix corresponding to the audio segments according to the number of the audio frames and the dimension.

Taking an audio segment as 500 ms as an example, if 1 frame of audio frame is 50 ms, then 1 audio segment includes 10 frames, and the application supports different kinds of common audio features, such as mfcc, fbank, plp format, and the like. If the feature is converted into an mfcc feature, each frame corresponds to a set of 13-dimensional numbers, then the feature matrix corresponding to the audio segment is a 10 × 13 two-dimensional matrix.

In step 308, the audio features are input into the keyword model generated according to the keyword model generation method of the above-mentioned embodiment of the present application in the form of a matrix to calculate the keyword probability of each audio segment. The probability of the keyword may be calculated by a predetermined algorithm with respect to the audio feature of each audio piece, for example, by a conventional method based on Viterbi (Viterbi).

Optionally, inputting the audio features into the keyword model for calculation to obtain a keyword probability of a corresponding audio segment, including: forming a network comprising a plurality of node states based on phonemes of the target language keyword corresponding to the keyword model and the audio frames of the audio segments, wherein each phoneme and a corresponding audio frame form a node state; taking the audio features corresponding to each audio frame as parameters, and calculating the posterior probability of each node state by using a predetermined algorithm; and determining the maximum value of the posterior probabilities in the plurality of node states as the keyword probability of the audio segment.

The following description is given in conjunction with the example of fig. 5, and for simplicity of description, the keyword model for detecting a multilingual audio piece in this example corresponds to only one keyword "attach", i.e., the keyword model is a target language keyword model generated based on the english keyword text "attach". The keyword one has 6 phonemes, and the input multilingual audio segments to be detected are 11 frames, namely 11 audio frames.

The 6 phonemes and the 11 audio frames form a 6 × 11 two-dimensional matrix, and each phoneme and a corresponding audio frame form a Hidden Markov Model (HMM) node state. In the Viterbi algorithm, the states of each node are first spread out according to a time axis to form a state network. Then, the audio features of the audio frame corresponding to each node state are taken as a match, and the Viterbi algorithm is used for calculating the posterior probability P (S | O) of each node state in the state network. The calculation of P (S | O) is based on the probability calculation formula of HMM and will not be described here. Then, backtracking is performed in the final node state (the node state at the lower right corner of fig. 5), and a path with the maximum posterior probability is found out, that is, the maximum posterior probability in each node state in the state network is determined, and the posterior probability corresponding to the path is the final keyword probability correspondingly detected by the audio clip.

As described above, the calculating of the posterior probability of each node state by using the predetermined algorithm when the phoneme of the target language keyword corresponding to the keyword model includes a plurality of different phonemes corresponding to the same phoneme of the language corresponding to the speech to be detected may further include: and respectively carrying out weighted calculation on the posterior probabilities of the node states corresponding to the phonemes according to the weights of the different phonemes corresponding to the same phoneme so as to obtain the posterior probabilities of the node states.

For example, the keywords of Kazakh are shown in FIG. 3

A keyword model generated by the corresponding Chinese phoneme sequence. Wherein the phonemes "p" and "b", "f", "h", "i" and "a" have a plurality of corresponding relationships. Thus, in calculating the keyword probabilities in step 308, the scores on each node state need to be weighted using weights. The weighting used by weighting is obtained according to the weighting in the phoneme mapping list. And respectively unfolding and combining the corresponding relations, corresponding to 8 keyword models, multiplying the posterior probability of each node state corresponding to each keyword model by weight, and adding the weighted results for 8 times for normalization to obtain the final posterior probability corresponding to the node state.

In step 310, optionally, detecting the keyword in the audio segment according to the keyword probability includes: comparing the keyword probability to a predetermined threshold; and when the keyword probability is greater than the preset threshold value, determining that the audio clip comprises the keywords. Otherwise, it is not included.

If a plurality of keyword models exist, the audio clip can obtain corresponding keyword probability after being detected by the plurality of keyword models, and the maximum keyword probability is selected from the keyword probabilities. And comparing with a preset threshold, and if the threshold is larger than the preset threshold, identifying that the keywords which are matched with the keyword model with the maximum keyword probability exist in the audio clip. If none of the maximum values is greater than the predetermined threshold, it indicates that no keywords have been identified in the audio segment.

The method for detecting the multi-language keywords converts a plurality of keywords corresponding to different languages into phoneme sequences corresponding to the keyword audio of a general target language. A target keyword model is then constructed based on the phoneme sequence. And finally, detecting the keywords based on the target keyword model. The method and the device can generate unified and general keyword models for different languages including various languages through the general target keyword models, and can accurately detect the keywords in the sentences mixed with multiple languages.

In addition, because the language identification model and the keyword model corresponding to the language identification model are not needed, a large amount of training audio is not needed, and training data are not needed to be acquired aiming at the small languages to train the language identification model, so that the problems of high cost caused by the fact that a large amount of training audio is needed and the difficulty in acquiring the training data by the small languages in the prior art are solved. Moreover, as the language identification model does not need to be trained, the problem that the short-time audio cannot be processed by language identification can be avoided. Therefore, the multilingual speech keyword detection method can accurately identify the multilingual speech keywords, reduce training audio required by keyword model generation and improve the efficiency of multilingual speech keyword detection.

Optionally, an electronic device is further provided in an embodiment of the present application, as shown in fig. 6, the electronic device 2000 includes a memory 2200 and a processor 2400 electrically connected to the memory 2200, where the memory 2200 stores a computer program that can be run by the processor 2400, and when the computer program is executed by the processor 2400, the processes of any one of the above embodiments of the keyword model generating method and the multi-language speech keyword detection method are implemented, and the same technical effect can be achieved, and details are not repeated here to avoid repetition.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of any one of the method embodiments, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one.. Said.", it is not intended to exclude that an additional identical element is present in a process, method, article or apparatus that comprises the same element.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A keyword model generation method is characterized by comprising the following steps:

acquiring a plurality of keyword texts corresponding to different languages;

respectively converting the keyword texts corresponding to different languages into phoneme sequences corresponding to all languages;

converting the phoneme sequence corresponding to the language into a phoneme sequence of the target language based on the mapping relation between phonemes of different languages and phonemes of the target language;

generating a plurality of keyword models corresponding to the keyword texts of different languages according to the converted phoneme sequence of the target language;

wherein, the same phoneme in the phoneme sequence corresponding to the language corresponds to at least one phoneme in the phoneme sequence of the target language;

when the same phoneme corresponds to a plurality of different phonemes in the phoneme sequence of the target language, the plurality of different phonemes respectively have corresponding weights, and the weights represent the probability that the same phoneme is represented as each phoneme in the plurality of different phonemes.

2. The method of claim 1, wherein generating the keyword models corresponding to the keyword texts corresponding to the different languages according to the converted phoneme sequence of the target language comprises:

3. A multilingual speech keyword detection method is characterized by comprising the following steps:

receiving a voice to be detected;

carrying out segmentation processing on the voice to be detected to obtain a plurality of audio segments:

converting each audio clip into a corresponding audio feature;

inputting the audio features into the keyword model according to claim 1 or 2 for calculation to obtain keyword probabilities of corresponding audio segments;

and detecting the keywords in the audio clip according to the keyword probability.

4. The method of claim 3, wherein converting each audio segment to a corresponding audio feature comprises:

determining a number of audio frames of the audio segment;

converting each audio frame of the audio segment into a set of audio feature values of corresponding dimensions;

and determining an audio feature matrix corresponding to the audio segments according to the number of the audio frames and the dimension.

5. The method of claim 4, wherein inputting the audio features into the keyword model for computation to obtain keyword probabilities for corresponding audio segments comprises:

forming a network comprising a plurality of node states based on phonemes of the target language keyword corresponding to the keyword model and the audio frames of the audio segments, wherein each phoneme and a corresponding audio frame form a node state;

taking the audio features corresponding to each audio frame as parameters, and calculating the posterior probability of each node state by using a predetermined algorithm;

and determining the maximum value of the posterior probabilities in the plurality of node states as the keyword probability of the audio segment.

6. The method according to claim 5, wherein when the phoneme of the keyword in the target language corresponding to the keyword model includes a plurality of different phonemes corresponding to a same phoneme of the language corresponding to the speech to be detected, the posterior probability of each node state is calculated using a predetermined algorithm, further comprising:

and respectively carrying out weighted calculation on the posterior probabilities of the node states corresponding to the phonemes according to the weights of the different phonemes corresponding to the same phoneme so as to obtain the posterior probabilities of the node states.

7. An electronic device, comprising: a memory and a processor electrically connected to the memory, the memory storing a computer program executable on the processor, the computer program, when executed by the processor, implementing the steps of the method according to any one of claims 1 to 6.

8. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.