CN112259077B

CN112259077B - Speech recognition method, device, terminal and storage medium

Info

Publication number: CN112259077B
Application number: CN202011126765.1A
Authority: CN
Inventors: 陈堆盛; 丁涵宇; 陈柱欣; 林悦; 黄伟贵
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2024-04-09
Anticipated expiration: 2040-10-20
Also published as: CN112259077A

Abstract

The embodiment of the invention discloses a voice recognition method, a voice recognition device, a voice recognition terminal and a voice recognition storage medium; the embodiment of the invention can acquire game voice of a player and a decoding network, wherein the decoding network comprises a non-keyword path and N keyword paths connected in parallel with the non-keyword path, and each keyword path corresponds to one keyword content; extracting features of the game voice to obtain voice features of the game voice; in the N keyword paths, determining keyword probabilities of N keywords corresponding to game voices based on voice features; and determining a non-keyword probability corresponding to the game language based on the voice feature in the non-keyword path; based on the keyword probabilities and the non-keyword probabilities of the N keywords, the voice content of the game voice is determined, so that the game voice recognition of the player is realized. The decoding network provided by the embodiment of the invention can accurately and rapidly identify the keywords and the non-keywords, thereby improving the voice recognition efficiency.

Description

Speech recognition method, device, terminal and storage medium

Technical Field

The present invention relates to the field of electronic games, and in particular, to a method, an apparatus, a terminal, and a storage medium for voice recognition.

Background

With the popularity of electronic games and the diversification of electronic game play, game speech becomes an interactive medium between players and games. However, when a player plays an electronic game, the real-time performance of voice recognition is required to be high, for example, when pronunciation corresponding to a keyword appears in a game voice stream, the game must quickly and accurately detect the keyword spoken by the player, so as to realize low-delay man-machine interaction.

However, when the current voice recognition method is applied to an electronic game scene, the problem of low recognition efficiency often occurs.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a voice recognition device, a terminal and a storage medium, which can improve voice recognition efficiency.

The embodiment of the invention provides a voice recognition method, which comprises the following steps:

the method comprises the steps of obtaining game voice of a player and a decoding network, wherein the decoding network comprises a non-keyword path and N keyword paths connected in parallel with the non-keyword path, each keyword path corresponds to one keyword content, and N is a positive integer;

extracting features of the game voice to obtain voice features of the game voice;

in the N keyword paths, determining keyword probabilities of N keywords corresponding to game voices based on voice features; the method comprises the steps of,

In the non-keyword path, determining non-keyword probability corresponding to the game language based on the voice characteristics;

based on the keyword probabilities and the non-keyword probabilities of the N keywords, the voice content of the game voice is determined, so that the game voice recognition of the player is realized.

The embodiment of the invention also provides a language identification device, which comprises:

the player game device comprises an acquisition unit, a decoding unit and a processing unit, wherein the acquisition unit is used for acquiring game voices of players and the decoding network, the decoding network comprises a non-keyword path and N keyword paths connected in parallel with the non-keyword path, each keyword path corresponds to one keyword content, and N is a positive integer;

the feature unit is used for extracting features of the game voice to obtain voice features of the game voice;

the keyword unit is used for determining keyword probabilities of N keywords corresponding to the game voice based on voice characteristics in the N keyword paths; the method comprises the steps of,

the non-keyword unit is used for determining non-keyword probability corresponding to the game language based on the voice characteristics in the non-keyword path;

and the recognition unit is used for determining the voice content of the game voice based on the keyword probabilities and the non-keyword probabilities of the N keywords, so that the game voice recognition of the player is realized.

In some embodiments, the non-keyword paths include i non-keyword combined phoneme paths, i is a positive integer, and the non-keyword units include:

a combined-phoneme subunit configured to determine, based on the speech feature of the game speech, non-critical combined-phoneme probabilities in each non-critical combined-phoneme path, thereby obtaining i non-critical combined-phoneme probabilities of i non-critical combined-phoneme paths;

a non-keyword subunit for determining non-keyword probabilities based on the i non-keyword combined phoneme probabilities of the i non-keyword combined phoneme paths.

In some embodiments, the combined-phoneme subunit is configured to determine, based on the speech characteristics of the game speech, non-critical-phoneme probabilities in each non-critical-phoneme path, thereby obtaining i x j non-critical-phoneme probabilities for i non-critical combined-phoneme paths;

in some embodiments, the non-keyword subunit is configured to determine the non-keyword probabilities based on i x j non-keyword single-phoneme probabilities of the i non-keyword combined phoneme paths.

In some embodiments, the non-critical monophonic paths include k state models, k is a positive integer, the game speech includes a plurality of speech segments, the speech features of the game speech include the speech features of the speech segments, each non-critical monophonic path includes a corresponding non-critical monophonic model, and the combined phoneme subunit is configured to:

Determining a voice fragment corresponding to the state model;

based on the state model in the non-key single-phoneme path, calculating the state probability of the state corresponding to the voice fragment according to the voice characteristics of the voice fragment, thereby determining k state probabilities corresponding to k state sub-models in the non-key single-phoneme path;

and determining the non-key single-pixel probability corresponding to the non-key single-pixel path based on the k state probabilities.

In some embodiments, the non-critical combined phoneme path further comprises a non-speech path, and the non-keyword subunit is further configured to:

based on the voice characteristics of game voice, determining the non-voice probability corresponding to each non-voice path, so as to obtain i non-voice probabilities of the non-voice paths in the i non-key combined phoneme paths;

the non-keyword probabilities are determined based on i x j non-critical monophonic probabilities for i non-critical combined phoneme paths and i non-phonetic probabilities for non-phonetic paths in the i non-critical combined phoneme paths.

In some embodiments, the keyword path includes x state models, keyword units, for:

determining a voice fragment corresponding to the state model in the keyword path;

based on the state models in the keyword paths, calculating the state probabilities of the voice fragments corresponding to the state models in the keyword paths according to the voice characteristics of the voice fragments corresponding to the state models in the keyword paths, so as to determine x state probabilities corresponding to x state models in the keyword paths;

And determining the keyword probability corresponding to the keyword path based on the x state probabilities corresponding to the x state models in the keyword path, so as to determine the keyword probability of the N keywords corresponding to the game voice.

In some embodiments, the state model is composed of a hidden Markov model and a Gaussian mixture model or a deep neural network model.

In some embodiments, the identification unit is configured to:

comparing the probability value of the non-keyword probability with the probability of the keywords of the N keywords, and determining the maximum probability result;

when the maximum probability result is the non-keyword probability, determining that the voice content of the game voice is the non-keyword content;

when the maximum probability result is one keyword probability of the N keyword probabilities, determining a keyword path corresponding to the maximum probability result, and determining the voice content of the game voice as the keyword content corresponding to the keyword path.

In some embodiments, the identification unit is configured to:

when the voice content of the game voice is the keyword content corresponding to the keyword path, determining game interaction operation corresponding to the keyword content;

in the game, the game roles operated by the player execute the game interaction operation, so that the voice interaction between the player and the game is realized.

In some embodiments, the recognition unit is configured to perform a voice masking operation on the game voice when the voice content of the game voice is the keyword content corresponding to the keyword path.

The embodiment of the invention also provides a terminal, which comprises a memory, wherein the memory stores a plurality of instructions; the processor loads instructions from the memory to perform steps in any of the speech recognition methods provided by the embodiments of the present invention.

The embodiment of the invention also provides a computer readable storage medium, which stores a plurality of instructions, the instructions are suitable for being loaded by a processor to execute the steps in any voice recognition method provided by the embodiment of the invention.

The embodiment of the invention can acquire game voice of a player and a decoding network, wherein the decoding network comprises a non-keyword path and N keyword paths connected in parallel with the non-keyword path, each keyword path corresponds to one keyword content, and N is a positive integer; extracting features of the game voice to obtain voice features of the game voice; in the N keyword paths, determining keyword probabilities of N keywords corresponding to game voices based on voice features; and determining a non-keyword probability corresponding to the game language based on the voice feature in the non-keyword path; based on the keyword probabilities and the non-keyword probabilities of the N keywords, the voice content of the game voice is determined, so that the game voice recognition of the player is realized.

In the present invention, a decoding network is provided, the decoding network has a non-keyword path capable of identifying all different non-keywords, and a plurality of keyword paths corresponding to different keywords; the non-keyword path can identify non-keywords formed by combining a plurality of different phonemes, so that the decoding network can accurately and quickly identify keywords and non-keywords. Thus, the voice recognition efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1a is a schematic view of a speech recognition method according to an embodiment of the present invention;

FIG. 1b is a flowchart illustrating a speech recognition method according to an embodiment of the present invention;

FIG. 1c is a schematic diagram of a decoding network structure of a speech recognition method according to an embodiment of the present invention;

FIG. 1d is a schematic illustration of a GMM-HMM of a speech recognition method provided by an embodiment of the present invention;

FIG. 1e is a schematic diagram of DNN-HMM of a speech recognition method according to an embodiment of the present invention;

FIG. 1f is a schematic diagram of an HMM development structure of a speech recognition method according to an embodiment of the present invention;

FIG. 1g is a schematic diagram of a decoding network structure of a speech recognition method according to an embodiment of the present invention;

FIG. 1h is a schematic diagram of a decoding network structure of a speech recognition method according to an embodiment of the present invention;

FIG. 2a is a schematic diagram of a path structure of a 3-gram phoneme language model of a speech recognition method according to an embodiment of the present invention;

FIG. 2b is a schematic diagram of a path structure of a 4-gram phoneme language model of a speech recognition method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a first configuration of a speech recognition device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides a voice recognition method, a voice recognition device, a voice recognition terminal and a storage medium.

The language identification device can be integrated in an electronic device, and the electronic device can be a terminal, a server and other devices. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer (Personal Computer, PC) or the like; the server may be a single server or a server cluster composed of a plurality of servers.

In some embodiments, the language recognition apparatus may also be integrated in a plurality of electronic devices, for example, the language recognition apparatus may be integrated in a plurality of servers, and the voice recognition method of the present invention is implemented by the plurality of servers.

In some embodiments, the server may also be implemented in the form of a terminal.

For example, referring to fig. 1a, the electronic device may be a notebook computer, where the notebook computer may obtain game voices of a player and a decoding network, where the decoding network includes a non-keyword path and N keyword paths connected in parallel to the non-keyword path, each keyword path corresponds to a keyword content, and N is a positive integer; extracting features of the game voice to obtain voice features of the game voice; in the N keyword paths, determining keyword probabilities of N keywords corresponding to game voices based on voice features; and determining a non-keyword probability corresponding to the game language based on the voice feature in the non-keyword path; based on the keyword probabilities and the non-keyword probabilities of the N keywords, the voice content of the game voice is determined, so that the game voice recognition of the player is realized.

The following will describe in detail. The numbers of the following examples are not intended to limit the preferred order of the examples.

Artificial intelligence (Artificial Intelligence, AI) is a technology that utilizes a digital computer to simulate the human perception environment, acquire knowledge, and use the knowledge, which can enable machines to function similar to human perception, reasoning, and decision. The artificial intelligence technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning, deep learning and other directions.

Among them, the speech processing techniques include an automatic speech recognition technique (Automatic Speech Recognition, ASR), which is a technique of performing operations such as recognition, listening, and the like on target audio using a computer instead of the human ear and further performing processing.

In this embodiment, a speech recognition method based on ASR is provided, as shown in fig. 1b, the specific flow of the speech recognition method may be as follows:

101. the method comprises the steps of obtaining game voice of a player and a decoding network, wherein the decoding network comprises a non-keyword path and N keyword paths connected in parallel with the non-keyword path, each keyword path corresponds to one keyword content, and N is a positive integer.

Various methods for obtaining the game voice of the player are available, for example, recording the voice of the player through a recording device, a sound sensor and the like, for example, a smart phone can record the audio of the player through a microphone; for another example, the player game voice recorded by the terminal is obtained by communicating with the terminal through a network, and the like.

In the process of speech recognition, certain specific words and sentences in the sentences need to be recognized, namely Keywords (Keywords) are recognized, and non-Keywords except Keywords are also recognized.

Note that the keywords and the non-keywords may each be composed of sentences, words, or the like, or may be a single word, syllable, or the like.

In chinese, each word may be composed of several phonemes (phonmes), which are the smallest phonetic units divided according to the natural properties of speech, e.g., a1 for phonemes of chinese "o", a2i for phonemes of chinese "ai", d and a2i for phonemes of generation, etc.

The method for obtaining the decoding network includes various methods, for example, communication with a database through a network to obtain the decoding network stored in the database; for another example, training the preset network by a machine learning method, so as to obtain a decoding network, and so on.

The method for training the preset network by the machine learning method to obtain the decoding network will be described below, and the network structures of the decoding network and the preset network will be described first:

in this embodiment, referring to fig. 1a, the decoding network may include a non-keyword path, and N keyword paths connected in parallel to the non-keyword path, where each keyword path may correspond to a keyword content, and N is a positive integer.

The preset network and the decoding network have the same network structure, but the network parameters of the preset network are different from those of the decoding network, i.e. the decoding network can be trained by the preset network.

102. And extracting the characteristics of the game voice to obtain the voice characteristics of the game voice.

The speech features of the game speech may include, among other things, mel-frequency cepstral coefficient (Mel Frequency Cepstrum Coefficient, MFCC) features, linear prediction coefficient (Linear Prediction Coefficient, LPC) features, line spectral frequency (Linear Spectrum Frequence, LSF) features, and so forth.

Wherein, the MFCC is a cepstrum parameter extracted in the Mel scale frequency domain, describing the nonlinear characteristics of human ear frequency, and the formula is as follows:

Mel＝2595*lg(1+f/700)

Where f is the frequency in hertz (Hz).

For example, in some embodiments, the game voice may be pre-framed, so as to extract Mel-frequency cepstrum coefficient features of each frame of voice segment in the game voice, so the specific steps of step 102 are as follows:

1. and performing pre-emphasis processing on the game voice to obtain the game voice after the pre-emphasis processing.

First, game voice Z may be subjected to pre-emphasis processing by a high-pass filter as shown below, to obtain game voice H (Z) after the pre-emphasis processing:

H(Z)＝1-uz ^-1

wherein u is a weight between 0.9 and 1.

The purpose of pre-emphasis is to boost the high frequency part of the game speech, making the spectrum of the game speech flatter, so that the same signal-to-noise ratio can be used later to find the spectrum of the low and high frequency parts. Meanwhile, part of noise in game voice can be eliminated, and high-frequency formants are highlighted.

2. And carrying out sliding window framing treatment on the game voice subjected to pre-emphasis treatment to obtain a plurality of voice fragments.

In some embodiments, to avoid excessive variation between two adjacent frames, a sliding window framing method may be used to make a section of overlap region between two adjacent frames. The sliding window framing means that a framing window with a preset length is assumed to slide unidirectionally on audio, and the audio in the framing window is collected as a voice segment every time a distance is slid.

For example, in some embodiments, a frame window of length 25 milliseconds may be employed to collect speech segments in frame steps of 10 milliseconds.

3. And performing fast Fourier transform on the voice fragments to obtain frequency domain information of the voice fragments.

To overcome the problem that the transformation of the audio signal in the time domain is difficult to embody the audio characteristics, in some embodiments the audio may be time-frequency transformed so that it is transformed into an energy distribution in the frequency domain.

In some embodiments, in order to make the time-frequency transformation more efficient and faster, further improve the efficiency of speech recognition, a fast fourier transform may be performed on the speech segments, so as to obtain the frequency domain information of the speech segments, i.e. the energy distribution of the audio frequency on the frequency spectrum.

The formula of the fast fourier transform is as follows:

where X (k) is an output frequency domain signal, X (N) is a time domain signal of the input audio, and N represents the number of points of fourier transform.

4. And carrying out Mel frequency cepstrum coefficient feature extraction on the frequency domain information to obtain the MFCC voice feature of the voice fragment.

First, the frequency domain information may be passed through a set of Mel-triangle filter banks that smooth the spectrum and eliminate the effects of harmonics, highlighting the formants of the speech. So that differences in the pitch of the speech will not affect the resulting features.

Wherein, the Mel triangle filter bank may include m triangle filters, and the frequency response of the triangle filters is defined as:

wherein:

the logarithmic energy s (m) of each filter bank output is then calculated as:

wherein, the value of M can be 22-26.

Finally, MFCC coefficients C (n) are obtained via discrete cosine transform (Discrete Cosine Transform, DCT):

where L is the coefficient order of the MFCC, typically taken from 12-16.

103. And determining keyword probabilities of N keywords corresponding to the game voice based on the voice characteristics in the N keyword paths.

Wherein each keyword path corresponds to one keyword content, for example, referring to fig. 1c, the keyword corresponding to the keyword path 1 is "attention alert", the keyword corresponding to the keyword path 2 is "preparation", and so on.

In some embodiments, a plurality of state models may be included in the keyword path, the state models being composed of hidden Markov models (Hidden Markov Model, HMM) and Gaussian mixture models (Mixture Of Gaussian Model, GMM) or deep neural network models (Deep Neural Networks, DNN).

Wherein the HMM is a markov chain model describing a hidden unknown parameter whose states are not directly observable but are obtainable from a sequence of observation vectors, each of which is represented as various states by some probability density distribution, each of which is generated from a sequence of states having a corresponding probability density distribution.

Wherein the GMM is a model that quantifies the audio features using gaussian probability density functions to derive the acoustic prior probability (prior probability) of the audio.

Wherein DNN is a model that maps audio features to acoustic prior probabilities (prior probability) through deep learning.

For example, in some embodiments, as shown in FIG. 1d, the state model is a GMM-HMM model; in some embodiments, as shown in FIG. 1e, the state model is a DNN-HMM model.

Wherein, a plurality of state models can be included in the keyword path, the state models can be used for predicting the states of the audio fragments, and each state model can be used for predicting the states of a plurality of audio fragments to represent a phoneme, so that the phonemes contained in the audio fragments can be determined.

Thus, the pronunciation of each word can be represented by HMM models of different parameters, e.g. referring to fig. 1f, the phonemes for the "note" word comprise phonemes zh and phonemes u; wherein the phoneme zh can be represented by states 1, 2, 3 and the phoneme u can be represented by states 4, 5, 6; the "note" word may be composed of three HMM models of phones zh and phones u, where the 3 HMMs of phones zh correspond to states 1, 2, 3, respectively; the 3 rd HMM of phoneme u corresponds to states 4, 5, 6, respectively.

Thus, referring to FIG. 1g, where the speech corresponding to keyword path 1 is "attention alert," the keyword path includes a state model of phonemes zh, u, y, i, j, ing, j, ie.

The state in each HMM model can be set manually according to actual requirements.

Each state of the HMM corresponds to an observation, which may be a real number or a vector, and in some embodiments is an MFCC.

After the MFCC feature sequence of the audio fragment is input into the HMM model, the probability of each HMM model generating the MFCC feature sequence can be solved in the HMM model by adopting a forward algorithm, and the maximum probability is taken as the keyword probability of the keyword path.

Thus, in some embodiments, x state models may be included in the keyword paths, and the step of determining the keyword probabilities of the N keywords corresponding to the game voice based on the voice features in the N keyword paths may include the steps of:

Wherein, referring to FIG. 1d, the HMM can be based on its own state a _sisi And a state transition probability a _sisi+1 To determine the next state a _s(i+1)s(i+1) . The self state can be obtained by the GMM, and the state transition probability of the HMM can be obtained according to the adjacent self state.

104. In the non-keyword path, a non-keyword probability corresponding to the game language is determined based on the speech features.

Referring to fig. 1g, the non-keyword paths may include i non-keyword combined phoneme paths, where each non-keyword combined phoneme path corresponds to a non-keyword combined phoneme, where the combined phoneme is formed by combining multiple different or identical phonemes, for example, for nonsensical voice "ao", the scheme may determine that the combined phoneme path is a non-keyword, and the non-keyword combined phoneme path is identified by the non-keyword combined phoneme path in the non-keyword path, where i is a positive integer.

Step 104 may include the steps of:

based on the voice characteristics of game voice, determining non-key combined phoneme probabilities in each non-key combined phoneme path, so as to obtain i non-key combined phoneme probabilities of i non-key combined phoneme paths;

Non-keyword probabilities are determined based on the i non-keyword combined phoneme probabilities of the i non-keyword combined phoneme paths.

The non-critical combined phoneme paths may include j non-critical single-phoneme paths connected in parallel, where j is a positive integer, and each non-critical single-phoneme path corresponds to one non-critical phoneme, for example, for nonsensical voice "ao", it may be composed of phonemes a and o, and the non-critical word path corresponding to "ao" may be formed by connecting in parallel the non-critical single-phoneme path corresponding to a and the non-critical single-phoneme path corresponding to o.

In particular, the step of determining non-critical combined-phoneme probabilities in each non-critical combined-phoneme path based on the speech characteristics of the game speech, thereby obtaining i non-critical combined-phoneme probabilities for i non-critical combined-phoneme paths, may comprise the steps of:

based on the voice characteristics of game voice, determining non-key single-phoneme probabilities in each non-key single-phoneme path, so as to obtain i.j non-key single-phoneme probabilities of i non-key combined-phoneme paths;

thus, the step of "determining non-keyword probabilities based on i non-key combined phoneme probabilities of i non-key combined phoneme paths" may comprise the steps of:

The non-keyword probabilities are determined based on i x j non-keyword monophonic probabilities for i non-keyword combined phoneme paths.

For example, referring to FIG. 1h, the non-keyword path is composed of non-critical combined phoneme paths a, o, e …, eng, where the non-critical combined phoneme path a is in turn composed of non-critical single phoneme paths a, o, e …, eng.

In some embodiments, referring to fig. 1h, each non-critical combined-phoneme path further includes a back-off path that can be used to repeatedly detect multiple nonsensical non-critical phonemes, and the structure can effectively exhaust all non-keywords, e.g., for speech "aaaa", the non-critical single-phoneme path aa of non-critical combined-phoneme path a can be traversed, and then the non-critical single-phoneme path aa of non-critical combined-phoneme path a can be traversed again by the back-off path to identify.

In some embodiments, each non-critical combined phoneme path may also include a non-voice path that may be used to identify non-voice portions of the player's voice. For example, for speech "a_", which is made up of phone a and non-speech_portions, non-speech paths through non-critical combined phone path a are identified.

Thus, in some embodiments, the non-critical combined phoneme path further includes a non-speech path, and determining the non-keyword probability based on i×j non-critical single phoneme probabilities of the i non-critical combined phoneme paths further includes:

Specifically, similar to the keyword paths, in some embodiments, k state models may be included in the non-critical monophonic paths, k being a positive integer, the game speech including a plurality of speech segments, the speech features of the game speech including the speech features of the speech segments, each non-critical monophonic path including a corresponding non-critical monophonic model, and determining a non-critical monophonic probability corresponding to each non-critical monophonic path based on the speech features of the game speech, including:

determining a voice fragment corresponding to the state model;

105. Based on the keyword probabilities and the non-keyword probabilities of the N keywords, the voice content of the game voice is determined, so that the game voice recognition of the player is realized.

In some embodiments, step 105 may include the steps of:

The above introduces the step of keyword detection in speech recognition, and the scheme can detect the game speech stream in real time, and the detection method can detect by adopting a sliding window detection method.

For example, in some embodiments, a detection window with a length of 1.5 seconds may be used to detect speech from a speech stream, and if no keyword is detected within the 1.5 second detection window, the detection window is moved backwards over the speech stream for 0.5 seconds to continue the next detection; the keyword is detected within the 1.5 second detection window, then the detection window is moved back on the voice stream for 1.5 seconds.

After the voice content is identified, man-machine interaction can be performed based on the voice content, so the step of determining the voice content of the game voice so as to realize the game voice identification of the player comprises the following steps:

Wherein, when the voice content is identified, keyword masking may be performed, so the step of determining the voice content of the game voice, thereby implementing game voice recognition for the player may include the steps of:

and when the voice content of the game voice is the keyword content corresponding to the keyword path, performing voice shielding operation on the game voice. As can be seen from the above, the embodiment of the present invention can obtain the game voice of the player and the decoding network, where the decoding network includes a non-keyword path and N keyword paths connected in parallel with the non-keyword path, each keyword path corresponds to one keyword content, and N is a positive integer; extracting features of the game voice to obtain voice features of the game voice; in the N keyword paths, determining keyword probabilities of N keywords corresponding to game voices based on voice features; and determining a non-keyword probability corresponding to the game language based on the voice feature in the non-keyword path; based on the keyword probabilities and the non-keyword probabilities of the N keywords, the voice content of the game voice is determined, so that the game voice recognition of the player is realized.

Thus in the present invention, a decoding network is provided having a non-keyword path that can absorb all of the different non-keywords, and a plurality of keyword paths corresponding to the different keywords; the non-keyword path can absorb non-keywords formed by combining a plurality of different phonemes, so that the decoding network can accurately and rapidly identify the keywords and the non-keywords. Thus, the voice recognition efficiency is improved.

The method described in the above embodiments will be described in further detail below.

In this embodiment, a method according to an embodiment of the present invention will be described in detail by taking voice control of a game as an example.

When the scheme is applied to game application, in order to balance the complexity and accuracy of voice recognition, according to the actual requirement of a game, the scheme adopts an HMM-GMM model to perform voice recognition on game scenes with low required recognition calculation amount and moderate accuracy; for a game scene with high recognition accuracy, the scheme adopts an HMM-DNN model to perform voice recognition.

In order to identify all keywords and non-keywords in player voices, referring to fig. 1h, the scheme proposes a decoding network, wherein the decoding network comprises 2 keyword paths and 1 non-keyword path, the first keyword path corresponds to a keyword 'attention alert', the 2 nd keyword path corresponds to a keyword 'preparation', and the non-keyword path comprises non-keyword combined phoneme paths a, o, e … and eng; wherein, the non-key combined phoneme path a comprises non-key single phoneme paths aa, ao, ae … and aeng; wherein the non-critical combined phoneme path o comprises non-critical single phoneme paths oa, oo, oe … and oeng; … the non-critical combined phoneme path eng includes non-critical single phoneme paths enga, engo, enge … and engeng.

It should be noted that in some embodiments, the combined phonemes may be two-to-two combined phonemes, may be three-to-three combined phonemes, may be four-to-four combined phonemes, and so on; when the phonemes are combined in three ways, the non-keyword path can comprise a non-keyword three-way combined phoneme path, the non-keyword three-way combined phoneme path can comprise a plurality of non-keyword two-to-two combined phoneme paths, and the non-keyword two-to-two combined phoneme path can comprise a plurality of non-keyword single phoneme paths.

Specifically, for an N-gram language model, when only a non-key single-phoneme path is included in a non-keyword path, the non-keyword path may be a 1-gram language model, and when a group of two-by-two combined phonemes is included in the non-keyword path, the non-keyword path may be a 2-gram language model; when a non-keyword path includes a combination of three and three combined phonemes, the non-keyword path may be a 3-gram language model; when the non-keyword path includes a set of four-four combined phones, the non-keyword path may be a 4-gram language model, and so on.

Wherein, N-Gram is a language model commonly used in large vocabulary continuous speech recognition, and for the middle, we call Chinese language model (CLM, chinese Language Model). The Chinese language model can realize automatic conversion to Chinese characters by utilizing collocation information between adjacent words in the context.

Hereinafter, the combining path may be an N-gram language model:

for example, referring to FIG. 2a, non-keyword paths include non-keyword tri-combined phoneme paths a, o, e …, eng; the non-critical tri-combination phoneme path a comprises non-critical pairwise phoneme paths aa, ao, ae … and ae; among them, the non-critical two-by-two phoneme path ao may include a non-critical single phoneme path aoa, aoo, aoe …, aoeng …, and so on.

Similarly, for example, referring to FIG. 2b, non-keyword paths include non-keyword four-combined phoneme paths a, o, e …, eng; the non-key four-four combined phoneme path a comprises non-key three-phoneme paths aa, ao, ae … and ae; the non-critical triphone path ao may include a non-critical two-to-two phoneme path aoa, aoo, aoe …, aoeng …; non-critical two-by-two phone paths aoa may include non-critical mono path aoaa, aoao, aoae …, aoaeng … …, and so on.

Because the complexity of the non-key four-four combined phoneme path is higher, in the scheme, in order to reduce the complexity of the non-key word path and improve the recognition efficiency, the non-key word path can only comprise the non-key two-two combined phoneme path.

Note that, in the non-keyword paths shown in fig. 2a and 2b, a fallback path may be further included; for example, each time the non-key single-phoneme path is passed, the non-key single-phoneme path can be returned to the starting point of the non-key-word path through the back-off path, and the non-key-phoneme detection can be repeated for a plurality of times.

In some embodiments, after the non-keyword probability is obtained, the non-keyword phoneme combination with lower probability can be discarded, so that the search scale is reduced when the probability size of the keyword probability and the non-keyword probability of the N keywords are compared, and the complexity of the non-keyword path is reduced.

In some embodiments, the method and the device can realize the purposes of detecting whether the voice data stream of the player in the game contains a certain part of keywords in real time and feeding back the detection result, thereby enriching the application scene of the voice in the game, improving the interaction efficiency and the interestingness of the game and improving the immersion of the player in the game.

In some embodiments, the scheme can realize voice supervision and realize the function of monitoring partial contraband words in real time (such as dirty words filtering and the like), thereby purifying the game environment.

As can be seen from the above, the present invention provides a decoding network for speech recognition based on phonemes, the decoding network having a non-keyword path capable of absorbing all different non-keywords, and a plurality of keyword paths corresponding to the different keywords; the non-keyword path can absorb non-keywords formed by combining various phonemes in pairs, so that the decoding network can accurately and rapidly identify keywords and non-keywords, and meanwhile, the complexity of voice recognition is reduced. Thus, the voice recognition efficiency is improved.

In order to better implement the method, the embodiment of the invention also provides a language identification device which can be integrated in electronic equipment, wherein the electronic equipment can be a terminal, a server and the like. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices; the server may be a single server or a server cluster composed of a plurality of servers.

For example, in this embodiment, a method of the embodiment of the present invention will be described in detail by taking a specific integration of a language identification device in XX as an example.

For example, as shown in fig. 3, the language recognition apparatus may include an acquisition unit 301, a feature unit 302, a keyword unit 303, a non-keyword unit 304, and a recognition unit 305, as follows:

and (one) an acquisition unit 301.

The obtaining unit 301 may be configured to obtain game voice of a player and a decoding network, where the decoding network may include a non-keyword path, and N keyword paths connected in parallel to the non-keyword path, where each keyword path corresponds to a keyword content, and N is a positive integer.

(two) feature cell 302.

The feature unit 302 may be configured to perform feature extraction on the game voice to obtain a voice feature of the game voice.

And (III) a keyword unit 303.

The keyword unit 303 may be configured to determine keyword probabilities of N keywords corresponding to the game voice based on the voice features in the N keyword paths.

In some embodiments, the keyword path may include x state models, and the keyword unit 303 may be configured to:

(IV) non-keyword unit 304.

The non-keyword unit 304 may be configured to determine a non-keyword probability corresponding to the game language based on the speech feature in the non-keyword path.

In some embodiments, the non-keyword paths may include i non-keyword combined phoneme paths, where i is a positive integer, and the non-keyword unit 304 may include a combined phoneme subunit and a non-keyword subunit, as follows:

(1) And a combined-phoneme subunit operable to determine non-critical combined-phoneme probabilities in each non-critical combined-phoneme path based on the speech characteristics of the game speech, thereby obtaining i non-critical combined-phoneme probabilities for the i non-critical combined-phoneme paths.

(2) A non-keyword subunit operable to determine non-keyword probabilities based on the i non-keyword combined phoneme probabilities of the i non-keyword combined phoneme paths.

In some embodiments, the combined-phoneme subunit may be configured to determine non-critical-phoneme probabilities in each non-critical-phoneme path based on the speech characteristics of the game speech, thereby obtaining i x j non-critical-phoneme probabilities for i non-critical combined-phoneme paths;

in some embodiments, the non-keyword subunit may be configured to determine non-keyword probabilities based on i x j non-keyword monophonic probabilities for i non-keyword combined phoneme paths.

In some embodiments, k state models may be included in the non-critical single-phoneme paths, k is a positive integer, the game speech may include a plurality of speech segments, the speech features of the game speech may include the speech features of the speech segments, each non-critical single-phoneme path may include a corresponding non-critical single-phoneme model, and the combined-phoneme subunit may be configured to:

Determining a voice fragment corresponding to the state model;

In some embodiments, the non-critical combined phoneme path may further comprise a non-speech path, and the non-critical word sub-unit may be further configured to:

And (fifth) an identification unit 305.

The recognition unit 305 may be configured to determine the voice content of the game voice based on the keyword probabilities and the non-keyword probabilities of the N keywords, thereby implementing game voice recognition for the player.

In some embodiments, the identification unit 305 may be configured to:

In some embodiments, the recognition unit 305 may be configured to perform a voice masking operation on the game voice when the voice content of the game voice is the keyword content corresponding to the keyword path.

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

As can be seen from the above, the language recognition device in this embodiment is configured to obtain a game voice of a player and a decoding network, where the decoding network includes a non-keyword path and N keyword paths connected in parallel to the non-keyword path, each keyword path corresponds to a keyword content, and N is a positive integer; extracting the characteristics of the game voice by the characteristic unit to obtain the voice characteristics of the game voice; determining keyword probabilities of N keywords corresponding to game voices based on voice characteristics in N keyword paths by a keyword unit; and determining, by the non-keyword unit, a non-keyword probability corresponding to the game language based on the speech feature in the non-keyword path; the recognition unit determines the voice content of the game voice based on the keyword probabilities and the non-keyword probabilities of the N keywords, so that the game voice recognition of the player is realized.

Therefore, the embodiment of the invention can improve the voice recognition efficiency.

Correspondingly, the embodiment of the application also provides a computer device, which can be a terminal or a server, wherein the terminal can be a terminal device such as a smart phone, a tablet computer, a notebook computer, a touch screen, a game console, a personal computer, a personal digital assistant (Personal Digital Assistant, PDA) and the like.

As shown in fig. 4, fig. 4 is a schematic structural diagram of a computer device provided in an embodiment of the present application, where the computer device 400 includes a processor 401 having one or more processing cores, a memory 402 having one or more computer readable storage media, and a computer program stored in the memory 402 and executable on the processor. The processor 401 is electrically connected to the memory 402. It will be appreciated by those skilled in the art that the computer device structure shown in the figures is not limiting of the computer device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

Processor 401 is a control center of computer device 400 and connects the various portions of the entire computer device 400 using various interfaces and lines to perform various functions of computer device 400 and process data by running or loading software programs and/or modules stored in memory 402 and invoking data stored in memory 402, thereby performing overall monitoring of computer device 400.

In the embodiment of the present application, the processor 401 in the computer device 400 loads the instructions corresponding to the processes of one or more application programs into the memory 402 according to the following steps, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Optionally, as shown in fig. 4, the computer device 400 further includes: a touch display 403, a radio frequency circuit 404, an audio circuit 405, an input unit 406, and a power supply 407. The processor 401 is electrically connected to the touch display 403, the radio frequency circuit 404, the audio circuit 405, the input unit 406, and the power supply 407, respectively. Those skilled in the art will appreciate that the computer device structure shown in FIG. 4 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components.

The touch display 403 may be used to display a graphical user interface and receive operation instructions generated by a user acting on the graphical user interface. The touch display screen 403 may include a display panel and a touch panel. Wherein the display panel may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of a computer device, which may be composed of graphics, text, icons, video, and any combination thereof. Alternatively, the display panel may be configured in the form of a liquid crystal display (LCD, liquid Crystal Display), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations on or near the user (such as operations on or near the touch panel by the user using any suitable object or accessory such as a finger, stylus, etc.), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends the touch point coordinates to the processor 401, and can receive and execute commands sent from the processor 401. The touch panel may overlay the display panel, and upon detection of a touch operation thereon or thereabout, the touch panel is passed to the processor 401 to determine the type of touch event, and the processor 401 then provides a corresponding visual output on the display panel in accordance with the type of touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 403 to implement the input and output functions. In some embodiments, however, the touch panel and the touch panel may be implemented as two separate components to perform the input and output functions. I.e. the touch-sensitive display 403 may also implement an input function as part of the input unit 406.

In the embodiment of the application, the processor 401 executes the game application program to generate a graphical user interface on the touch display screen 403, where the virtual scene on the graphical user interface includes at least one skill control area, and the skill control area includes at least one skill control. The touch display 403 is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface.

The radio frequency circuitry 404 may be used to transceive radio frequency signals to establish wireless communications with a network device or other computer device via wireless communications.

The audio circuitry 405 may be used to provide an audio interface between a user and a computer device through speakers, microphones, and so on. The audio circuit 405 may transmit the received electrical signal after audio data conversion to a speaker, where the electrical signal is converted into a sound signal for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 405 and converted into audio data, which are processed by the audio data output processor 401 and sent via the radio frequency circuit 404 to, for example, another computer device, or which are output to the memory 402 for further processing. The audio circuit 405 may also include an ear bud jack to provide communication of the peripheral ear bud with the computer device.

The input unit 406 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 407 is used to power the various components of the computer device 400. Alternatively, the power supply 407 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, and power consumption management through the power management system. The power supply 407 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown in fig. 4, the computer device 400 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., and will not be described herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

As can be seen from the above, the computer device provided in the embodiment can improve the voice recognition efficiency.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of computer programs that can be loaded by a processor to perform steps in any of the speech recognition methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The steps in any voice recognition method provided in the embodiments of the present application may be executed by the computer program stored in the storage medium, so that the beneficial effects that any voice recognition method provided in the embodiments of the present application may be achieved, which are detailed in the previous embodiments and are not repeated herein.

The foregoing has described in detail the methods, apparatuses, storage media and computer devices for speech recognition provided by the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the methods and core ideas of the present application; meanwhile, those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, and the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of speech recognition, comprising:

the method comprises the steps that game voice of a player and a decoding network are obtained, the decoding network comprises a non-keyword path and N keyword paths connected with the non-keyword path in parallel, each keyword path corresponds to one keyword content, N is a positive integer, the non-keyword path comprises a non-keyword combined phoneme path, and the non-keyword combined factor path comprises a non-keyword single factor path and a rollback path;

in the N keyword paths, determining keyword probabilities of N keywords corresponding to the game voice based on the voice characteristics; the method comprises the steps of,

in the non-keyword path, the voice feature is identified through the non-keyword single-factor path, the non-keyword probability corresponding to the currently identified game voice is determined, the voice feature is identified through the non-keyword single-factor path again through the rollback path after returning to the starting point of the non-keyword path, and the non-keyword probability detection of the game voice is completed;

And determining the voice content of the game voice based on the keyword probabilities of the N keywords and the non-keyword probabilities, so that the game voice recognition of the player is realized.

2. The method of claim 1, wherein the non-keyword paths include i non-keyword combined phoneme paths, i is a positive integer, and the identifying the voice feature through the non-keyword single-factor path in the non-keyword paths, determining a non-keyword probability corresponding to a currently identified game language includes:

based on the voice characteristics of the game voice, determining non-key single-phoneme probabilities through non-key single-factor paths corresponding to each non-key combined phoneme path, so as to obtain i non-key single-phoneme probabilities of the i non-key combined phoneme paths;

the non-keyword probabilities are determined based on i non-key monophonic probabilities of the i non-key combined phoneme paths.

3. The method for recognizing speech according to claim 2, wherein the non-critical combined-phoneme paths include j non-critical single-phoneme paths connected in parallel, j being a positive integer, and the determining non-critical single-phoneme probabilities by the non-critical single-factor paths corresponding to each non-critical combined-phoneme path based on the speech characteristics of the game speech, thereby obtaining i non-critical single-phoneme probabilities of the i non-critical combined-phoneme paths, includes:

Determining non-key single-phoneme probabilities in each non-key single-phoneme path based on the voice characteristics of the game voice, so as to obtain i.j non-key single-phoneme probabilities of i non-key combined-phoneme paths;

the determining the non-keyword probabilities based on the i non-critical monophonic probabilities of the i non-critical combined phoneme paths comprises:

the non-keyword probabilities are determined based on the i x j non-keyword single-phoneme probabilities of the i non-keyword combined phoneme paths.

4. The method of claim 3, wherein the non-critical monophonic paths include k state models, k is a positive integer, the game speech includes a plurality of speech segments, the speech features of the game speech include the speech features of the speech segments, each non-critical monophonic path includes a corresponding non-critical monophonic model, and the determining the non-critical monophonic probability by the corresponding non-critical monophonic path for each non-critical combined phoneme path based on the speech features of the game speech comprises:

determining a voice fragment corresponding to the state model;

for each non-critical single-factor path, calculating the state probability of the state corresponding to the voice segment according to the voice characteristics of the voice segment based on the state model in the non-critical single-factor path, so as to determine k state probabilities corresponding to k state submodels in the non-critical single-factor path;

And determining the non-critical single-pixel probability corresponding to the non-critical single-pixel path based on the k state probabilities.

5. The method of claim 3 wherein said non-critical combined phoneme path further comprises a non-speech path, said determining said non-keyword probabilities based on i x j non-critical single phoneme probabilities of said i non-critical combined phoneme paths further comprising:

based on the voice characteristics of the game voice, determining the non-voice probability corresponding to each non-voice path, so as to obtain i non-voice probabilities of the non-voice paths in the i non-key combined phoneme paths;

and determining the non-keyword probability based on the i.j non-key single-phoneme probabilities of the i non-key combined phoneme paths and the i non-voice probabilities of the non-voice paths in the i non-key combined phoneme paths.

6. The method of claim 1, wherein the keyword paths include x state models, and wherein determining keyword probabilities of N keywords corresponding to the game voice based on the voice features in the N keyword paths includes:

Based on the state model in the keyword path, calculating the state probability of the voice segment corresponding to the state model in the keyword path according to the voice feature of the voice segment corresponding to the state model in the keyword path, so as to determine x state probabilities corresponding to x state models in the keyword path;

7. The speech recognition method of claim 4 or 6, wherein the state model is composed of a hidden markov model and a gaussian mixture model or a deep neural network model.

8. The method of claim 1, wherein the determining the speech content of the game speech based on the keyword probabilities of the N keywords and the non-keyword probabilities comprises:

comparing the probability values of the non-keyword probability and the keyword probabilities of the N keywords, and determining the maximum probability result;

when the maximum probability result is the non-keyword probability, determining that the voice content of the game voice is non-keyword content;

And when the maximum probability result is one keyword probability in the N keyword probabilities, determining a keyword path corresponding to the maximum probability result, and determining the voice content of the game voice as the keyword content corresponding to the keyword path.

9. The method of claim 8, wherein said determining the voice content of the game voice to thereby enable game voice recognition of the player comprises:

and enabling the game role operated by the player to execute the game interaction operation in the game, so that the voice interaction between the player and the game is realized.

10. The method of claim 8, wherein said determining the voice content of the game voice to thereby enable game voice recognition of the player comprises:

and when the voice content of the game voice is the keyword content corresponding to the keyword path, performing voice shielding operation on the game voice.

11. A language identification device, comprising:

The game voice processing device comprises an acquisition unit, a decoding unit and a processing unit, wherein the acquisition unit is used for acquiring game voices of players and the decoding network, the decoding network comprises a non-keyword path and N keyword paths connected with the non-keyword path in parallel, each keyword path corresponds to one keyword content, N is a positive integer, the non-keyword path comprises a non-keyword combined phoneme path, and the non-keyword combined factor path comprises a non-keyword single factor path and a rollback path;

the keyword unit is used for determining keyword probabilities of N keywords corresponding to the game voice based on the voice characteristics in the N keyword paths; the method comprises the steps of,

the non-keyword unit is used for identifying the voice feature through the non-keyword single-factor path in the non-keyword path, determining the non-keyword probability corresponding to the currently identified game voice, returning to the starting point of the non-keyword path through the rollback path, and identifying the voice feature through the non-keyword single-factor path again until the non-keyword probability detection of the game voice is completed;

And the recognition unit is used for determining the voice content of the game voice based on the keyword probabilities of the N keywords and the non-keyword probabilities, so that the game voice recognition of the player is realized.

12. A terminal comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps of the speech recognition method according to any one of claims 1 to 10.

13. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the speech recognition method according to any one of claims 1 to 10.