CN112259077A

CN112259077A - Voice recognition method, device, terminal and storage medium

Info

Publication number: CN112259077A
Application number: CN202011126765.1A
Authority: CN
Inventors: 陈堆盛; 丁涵宇; 陈柱欣; 林悦; 黄伟贵
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2021-01-22
Anticipated expiration: 2040-10-20
Also published as: CN112259077B

Abstract

The embodiment of the invention discloses a voice recognition method, a voice recognition device, a terminal and a storage medium; the embodiment of the invention can acquire the game voice of the player and a decoding network, wherein the decoding network comprises a non-keyword path and N keyword paths connected with the non-keyword path in parallel, and each keyword path corresponds to a keyword content; extracting the characteristics of the game voice to obtain the voice characteristics of the game voice; determining keyword probabilities of N keywords corresponding to the game voice based on the voice characteristics in the N keyword paths; determining the probability of the non-keyword corresponding to the game language based on the voice characteristics in the non-keyword path; and determining the voice content of the game voice based on the keyword probability and the non-keyword probability of the N keywords, thereby realizing game voice recognition of the player. The decoding network provided by the embodiment of the invention can accurately and quickly identify the keywords and the non-keywords, thereby improving the voice identification efficiency.

Description

Voice recognition method, device, terminal and storage medium

Technical Field

The invention relates to the field of electronic games, in particular to a voice recognition method, a voice recognition device, a terminal and a storage medium.

Background

With the popularization of electronic games and the diversification of electronic game playing methods, game voices serve as interaction media between players and games. However, when a player plays an electronic game, the requirement on the real-time performance of speech recognition is high, for example, when pronunciation corresponding to a keyword appears in a game speech stream, the game must quickly and accurately detect the keyword spoken by the player, so as to realize low-delay man-machine interaction.

However, when the current speech recognition method is applied to an electronic game scene, the problem of low recognition efficiency often occurs.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a voice recognition device, a terminal and a storage medium, which can improve the voice recognition efficiency.

The embodiment of the invention provides a voice recognition method, which comprises the following steps:

obtaining game voice of a player and a decoding network, wherein the decoding network comprises a non-keyword path and N keyword paths connected with the non-keyword path in parallel, each keyword path corresponds to keyword content, and N is a positive integer;

extracting the characteristics of the game voice to obtain the voice characteristics of the game voice;

determining keyword probabilities of N keywords corresponding to the game voice based on the voice characteristics in the N keyword paths; and the number of the first and second groups,

in the non-keyword path, determining the probability of the non-keyword corresponding to the game language based on the voice characteristics;

and determining the voice content of the game voice based on the keyword probability and the non-keyword probability of the N keywords, thereby realizing game voice recognition of the player.

An embodiment of the present invention further provides a language identification device, including:

the player game system comprises an acquisition unit, a decoding unit and a processing unit, wherein the acquisition unit is used for acquiring game voice of a player and a decoding network, the decoding network comprises a non-keyword path and N keyword paths connected with the non-keyword path in parallel, each keyword path corresponds to a keyword content, and N is a positive integer;

the characteristic unit is used for extracting the characteristics of the game voice to obtain the voice characteristics of the game voice;

the keyword unit is used for determining the keyword probability of N keywords corresponding to the game voice based on the voice characteristics in the N keyword paths; and the number of the first and second groups,

the non-keyword unit is used for determining the probability of the non-keyword corresponding to the game language based on the voice characteristics in the non-keyword path;

and the recognition unit is used for determining the voice content of the game voice based on the keyword probability and the non-keyword probability of the N keywords, so that the game voice recognition of the player is realized.

In some embodiments, the non-keyword paths include i non-key combination phoneme paths, i is a positive integer, and the non-keyword units include:

the combined phoneme subunit is used for determining the non-key combined phoneme probability in each non-key combined phoneme path based on the voice characteristics of the game voice so as to obtain i non-key combined phoneme probabilities of the i non-key combined phoneme paths;

and the non-keyword subunit is used for determining the probability of the non-keyword based on the i non-key combined phoneme probabilities of the i non-key combined phoneme paths.

In some embodiments, the combined phoneme subunit is configured to determine a non-key monophonic probability in each non-key monophonic path based on the speech features of the game speech, thereby obtaining i × j non-key monophonic probabilities for i non-key combined phoneme paths;

in some embodiments, the non-keyword sub-unit is configured to determine the non-keyword probability based on i x j non-key single-phoneme probabilities of the i non-key combined-phoneme paths.

In some embodiments, the non-critical monophonic paths include k state models, k being a positive integer, the game speech includes a plurality of speech segments, the speech features of the game speech include speech features of the speech segments, each non-critical monophonic path includes a corresponding non-critical monophonic model, and the combined phoneme subunit is configured to:

determining a voice segment corresponding to the state model;

based on the state model in the non-key single-phone path, calculating the state probability of the state corresponding to the voice fragment according to the voice characteristics of the voice fragment, thereby determining k state probabilities corresponding to k state submodels in the non-key single-phone path;

and determining the non-key single-phone probability corresponding to the non-key single-phone path based on the k state probabilities.

In some embodiments, the non-key combined phoneme path further includes a non-speech path, and the non-keyword subunit is further configured to:

determining the non-voice probability corresponding to each non-voice path based on the voice characteristics of the game voice, thereby obtaining i non-voice probabilities of the non-voice paths in the i non-key combined phoneme paths;

and determining the probability of the non-key words based on i x j non-key single phone probabilities of i non-key combined phone paths and i non-voice probabilities of non-voice paths in the i non-key combined phone paths.

In some embodiments, the keyword path includes x state models, a keyword unit to:

determining a voice segment corresponding to the state model in the keyword path;

based on the state model in the keyword path, calculating the state probability of the voice segment corresponding to the state model in the keyword path according to the voice characteristics of the voice segment corresponding to the state model in the keyword path, thereby determining x state probabilities corresponding to x state models in the keyword path;

and determining the keyword probability corresponding to the keyword path based on the x state probabilities corresponding to the x state models in the keyword path, thereby determining the keyword probability of the N keywords corresponding to the game voice.

In some embodiments, the state model is composed of a hidden markov model and a gaussian mixture model or a deep neural network model.

In some embodiments, an identification unit to:

comparing the probability values of the non-keyword probability and the keyword probabilities of the N keywords to determine the maximum probability result;

when the maximum probability result is the non-keyword probability, determining the voice content of the game voice as the non-keyword content;

and when the maximum probability result is one keyword probability in the N keyword probabilities, determining a keyword path corresponding to the maximum probability result, and determining the voice content of the game voice as the keyword content corresponding to the keyword path.

In some embodiments, an identification unit to:

when the voice content of the game voice is the keyword content corresponding to the keyword path, determining game interaction operation corresponding to the keyword content;

the game character operated by the player is enabled to execute game interaction operation in the game, and therefore voice interaction between the player and the game is achieved.

In some embodiments, the recognition unit is configured to perform a voice masking operation on the game voice when the voice content of the game voice is the keyword content corresponding to the keyword path.

The embodiment of the invention also provides a terminal, which comprises a memory, a first memory and a second memory, wherein the memory stores a plurality of instructions; the processor loads instructions from the memory to perform the steps of any of the speech recognition methods provided by the embodiments of the present invention.

The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to perform steps in any one of the voice recognition methods provided by the embodiments of the present invention.

The embodiment of the invention can acquire game voice of a player and a decoding network, wherein the decoding network comprises a non-keyword path and N keyword paths connected with the non-keyword path in parallel, each keyword path corresponds to a keyword content, and N is a positive integer; extracting the characteristics of the game voice to obtain the voice characteristics of the game voice; determining keyword probabilities of N keywords corresponding to the game voice based on the voice characteristics in the N keyword paths; determining the probability of the non-keyword corresponding to the game language based on the voice characteristics in the non-keyword path; and determining the voice content of the game voice based on the keyword probability and the non-keyword probability of the N keywords, thereby realizing game voice recognition of the player.

In the present invention, a decoding network is provided, which has a non-keyword path that can identify all different non-keywords, and a plurality of keyword paths corresponding to different keywords; the non-keyword path can identify non-keywords formed by combining a plurality of different phonemes, so that the decoding network can accurately and quickly identify the keywords and the non-keywords. Therefore, the voice recognition efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1a is a schematic view of a speech recognition method according to an embodiment of the present invention;

FIG. 1b is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 1c is a schematic diagram of a decoding network structure of a speech recognition method according to an embodiment of the present invention;

FIG. 1d is a schematic diagram of a GMM-HMM of a speech recognition method provided by an embodiment of the present invention;

FIG. 1e is a schematic diagram of a DNN-HMM of a speech recognition method provided by an embodiment of the present invention;

FIG. 1f is a schematic diagram of an HMM expansion structure of a speech recognition method according to an embodiment of the present invention;

FIG. 1g is a schematic diagram of a decoding network structure of a speech recognition method according to an embodiment of the present invention;

fig. 1h is a schematic diagram of a decoding network structure of a speech recognition method according to an embodiment of the present invention;

FIG. 2a is a schematic diagram of a path structure of a 3-gram phoneme language model of a speech recognition method according to an embodiment of the present invention;

FIG. 2b is a schematic diagram of a path structure of a 4-gram phoneme language model of a speech recognition method according to an embodiment of the present invention;

FIG. 3 is a first structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a voice recognition method, a voice recognition device, a terminal and a storage medium.

The language identification device may be specifically integrated in an electronic device, and the electronic device may be a terminal, a server, or the like. The terminal can be a mobile phone, a tablet Computer, an intelligent bluetooth device, a notebook Computer, or a Personal Computer (PC), and the like; the server may be a single server or a server cluster composed of a plurality of servers.

In some embodiments, the speech recognition apparatus may also be integrated into a plurality of electronic devices, for example, the speech recognition apparatus may be integrated into a plurality of servers, and the speech recognition method of the present invention is implemented by the plurality of servers.

In some embodiments, the server may also be implemented in the form of a terminal.

For example, referring to fig. 1a, the electronic device may be a notebook computer, the notebook computer may obtain game voices of players and a decoding network, the decoding network includes a non-keyword path and N keyword paths connected in parallel with the non-keyword path, each keyword path corresponds to a keyword content, and N is a positive integer; extracting the characteristics of the game voice to obtain the voice characteristics of the game voice; determining keyword probabilities of N keywords corresponding to the game voice based on the voice characteristics in the N keyword paths; determining the probability of the non-keyword corresponding to the game language based on the voice characteristics in the non-keyword path; and determining the voice content of the game voice based on the keyword probability and the non-keyword probability of the N keywords, thereby realizing game voice recognition of the player.

The following are detailed below. The numbers in the following examples are not intended to limit the order of preference of the examples.

Artificial Intelligence (AI) is a technique that uses a digital computer to simulate the human perception environment, acquire knowledge, and use the knowledge, which can make a machine function similar to human perception, reasoning, and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning, deep learning and the like.

The Speech processing technology includes an Automatic Speech Recognition technology (ASR), which is a technology for performing operations such as Recognition and monitoring on a target audio by using a computer instead of a human ear and further performing processing.

In this embodiment, a speech recognition method based on ASR is provided, as shown in fig. 1b, a specific flow of the speech recognition method may be as follows:

101. the method comprises the steps of obtaining game voice of a player and a decoding network, wherein the decoding network comprises a non-keyword path and N keyword paths connected with the non-keyword path in parallel, each keyword path corresponds to keyword content, and N is a positive integer.

The method for acquiring the game voice of the player has various methods, for example, the method is obtained by recording the voice of the player through a recording device, a sound sensor, and the like, for example, a smart phone may record the audio of the player through a microphone; as another example, communication with the terminal via a network may result in a player's game voice recorded by the terminal, and so on.

The game voice is an audio signal composed of a plurality of sentences spoken by a player, and in the voice recognition process, some specific words in the sentences need to be recognized, namely, Keywords (Keywords) need to be recognized, and non-Keywords except the Keywords need to be recognized.

It should be noted that the keywords and non-keywords may be composed of sentences, words, etc., or may be a single word, syllable, etc.

In the chinese language, each word may be composed of several phonemes (Phoneme) which are the smallest phonetic units divided according to natural attributes of the speech, for example, the Phoneme of chinese "o" is a1, the Phoneme of chinese "love" is a2i, the generation phonemes are d and a2i, and so on.

The method for acquiring the decoding network is various, for example, the decoding network stored in the database is acquired through communication between the network and the database; for another example, the preset network is trained by a machine learning method, so as to obtain a decoding network, and so on.

The method for training the preset network through the machine learning method to obtain the decoding network will be described below, and the network structures of the decoding network and the preset network are described first:

in this embodiment, referring to fig. 1a, the decoding network may include a non-keyword path and N keyword paths connected in parallel to the non-keyword path, where each keyword path may correspond to a keyword content, and N is a positive integer.

The network structures of the preset network and the decoding network are the same, but the network parameters of the preset network are different from those of the decoding network, that is, the decoding network can be obtained by training of the preset network.

102. And extracting the characteristics of the game voice to obtain the voice characteristics of the game voice.

The voice feature of the game voice may include a Mel Frequency Cepstrum Coefficient (MFCC) feature, a Linear Prediction Coefficient (LPC) feature, a Line Spectrum Frequency (LSF) feature, and the like.

The MFCC is a cepstrum parameter extracted in the Mel scale frequency domain, describes the nonlinear characteristic of human ear frequency, and has the following formula:

Mel＝2595*lg(1+f/700)

where f is frequency in hertz (Hz).

For example, in some embodiments, the game speech may be subjected to frame segmentation in advance, so as to extract and obtain Mel-frequency cepstrum coefficient features of each frame of speech segment in the game speech, so the specific steps of step 102 are as follows:

1. and carrying out pre-emphasis processing on the game voice to obtain the game voice after the pre-emphasis processing.

First, game speech z may be subjected to pre-emphasis processing by a high-pass filter as follows, to obtain pre-emphasized game speech h (z):

H(Z)＝1-uz^-1

wherein u is a weight between 0.9 and 1.

The purpose of pre-emphasis is to boost the high frequency part of the game speech and make the frequency spectrum of the game speech flatter, so that the same signal-to-noise ratio can be used later to find the frequency spectrum of the low frequency part and the high frequency part. Meanwhile, the method can eliminate a part of noise in game voice and highlight the formants of high frequency.

2. And carrying out sliding window frame division processing on the game voice subjected to the pre-emphasis processing to obtain a plurality of voice fragments.

In some embodiments, in order to avoid too large variation between two adjacent frames, a sliding window framing method may be adopted such that there is an overlapping region between two adjacent frames. The sliding window framing means that a framing window with a preset length slides in a single direction on audio, and the audio in the framing window is collected as a voice fragment when the framing window slides for a certain distance.

For example, in some embodiments, a speech segment may be acquired in 10 ms framing steps using a 25 ms framing window.

3. And carrying out fast Fourier transform on the voice segments to obtain frequency domain information of the voice segments.

In order to overcome the problem that the transform of the audio signal in the time domain is difficult to embody the audio characteristics, in some embodiments, the audio may be time-frequency transformed so as to be converted into an energy distribution in the frequency domain.

In some embodiments, in order to make the time-frequency transformation more efficient and faster, and further improve the efficiency of speech recognition, a fast fourier transform may be performed on the speech segment, so as to obtain frequency domain information of the speech segment, i.e., the energy distribution of the audio frequency on the frequency spectrum.

The formula for the fast fourier transform is as follows:

where x (k) is the output frequency domain signal, x (N) is the time domain signal of the input audio, and N represents the number of points of fourier transform.

4. And performing Mel frequency cepstrum coefficient feature extraction on the frequency domain information to obtain the MFCC voice features of the voice fragments.

First, the frequency domain information may be passed through a set of Mel-triangle filter banks that smooth the spectrum and remove the effects of harmonics, highlighting the formants of the speech. So that differences in speech pitch do not affect the resulting characteristics.

Wherein, the Mel triangular filter bank may include m triangular filters, and the frequency response of the triangular filter is defined as:

wherein:

the logarithmic energy s (m) of each filter bank output is then calculated as:

wherein, the value of M can be 22-26.

Finally, the MFCC coefficient c (n) is obtained by Discrete Cosine Transform (DCT):

where L is the coefficient order of the MFCC, typically 12-16.

103. And determining the keyword probability of the N keywords corresponding to the game voice based on the voice characteristics in the N keyword paths.

Each keyword path corresponds to a keyword content, for example, referring to fig. 1c, the keyword corresponding to the keyword path 1 is "caution alert", the keyword corresponding to the keyword path 2 is "preparation", and the like.

In some embodiments, a plurality Of state models may be included in the keyword path, the state models being composed Of Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) or Deep Neural Networks (DNNs).

The HMM is a model for describing a markov chain with hidden unknown parameters, the states of which cannot be directly observed but can be obtained by finding a sequence of observation vectors, each of which is represented as a variety of states by some probability density distribution, each of which is generated by a state sequence with a corresponding probability density distribution.

The GMM is a model that quantizes audio features by using a gaussian probability density function, thereby obtaining a prior acoustic probability (prior probability) of the audio.

The DNN is a model that maps audio features to acoustic prior probability (prior probability) through deep learning.

For example, in some embodiments, as shown in FIG. 1d, the state model is a GMM-HMM model; in some embodiments, as shown in FIG. 1e, the state model is a DNN-HMM model.

The keyword path may include a plurality of state models, and the state models may be used to predict states of the audio segments, and since the phonemes may be represented by a plurality of states, each state model may represent a phoneme by predicting states of a plurality of audio segments, so as to determine phonemes contained in the audio segments.

Thus, the pronunciation of each word can be represented by an HMM model of different parameters, e.g. referring to fig. 1f, the phonemes for a "gazed" word include phoneme zh and phoneme u; wherein, the phoneme zh can be represented by the

states

1, 2, 3, and the phoneme u can be represented by the

states

4, 5, 6; so the "Zhu" word can be composed of three HMM models of a phoneme zh and a phoneme u, where 3 HMM of the phoneme zh correspond to

states

1, 2, 3, respectively; the 3 rd HMM for phoneme u corresponds to

states

4, 5, 6, respectively.

Thus, referring to FIG. 1g, the speech corresponding to keyword path 1 is "attention guard" and the keyword path includes state models of phonemes zh, u, y, i, j, ing, j, ie.

Wherein, the state in each HMM model can be set artificially according to the actual requirement.

Each state of the HMM corresponds to an observation, which may be a real number or a vector, and in some embodiments, the observation is a MFCC.

After the MFCC feature sequences of the audio clip are input into the HMM model, the probability of generating the MFCC feature sequences of each HMM model can be obtained in the HMM model by adopting a forward algorithm, and the maximum probability is taken as the keyword probability of the keyword path.

Thus, in some embodiments, x state models may be included in the keyword path, and the step "determining, in N keyword paths, a keyword probability of N keywords corresponding to the game speech based on the speech features" may include the steps of:

Wherein, referring to FIG. 1d, HMM can be based on self state a_sisiAnd a probability of state transition a_sisi+1To determine the next state a_s(i+1)s(i+1). The self state can be obtained by GMM, and the state transition probability of HMM can be obtained according to the adjacent self state.

104. In the non-keyword path, the non-keyword probability corresponding to the game language is determined based on the voice characteristics.

Referring to fig. 1g, the non-keyword path may include i non-key combination phoneme paths, where each non-key combination phoneme path corresponds to a combination phoneme of a non-keyword, the combination phoneme is formed by combining multiple different phonemes or the same phoneme, for example, for an meaningless speech "ao", the non-key combination phoneme path may be determined as a non-keyword, and is identified by the non-key combination phoneme path in the non-keyword path, and the non-key combination phoneme path corresponds to the non-keyword "ao", where i is a positive integer.

Step 104 may therefore include the following steps:

determining non-key combination phoneme probabilities in each non-key combination phoneme path based on the voice characteristics of the game voice, thereby obtaining i non-key combination phoneme probabilities of the i non-key combination phoneme paths;

the non-keyword probability is determined based on the i non-key combination phoneme probabilities of the i non-key combination phoneme paths.

The non-key combination phoneme path may include j non-key single-phoneme paths connected in parallel, where j is a positive integer, each non-key single-phoneme path corresponds to a non-key phoneme, for example, for a nonsense speech "ao", it may be composed of phonemes a and o, and the non-keyword path corresponding to "ao" may be formed by connecting a corresponding non-key single-phoneme path and o corresponding non-key single-phoneme path in parallel.

Specifically, the step of determining the non-key combination phoneme probability in each non-key combination phoneme path based on the voice characteristics of the game voice so as to obtain i non-key combination phoneme probabilities of i non-key combination phoneme paths may include the following steps:

determining non-key single-phone probabilities in each non-key single-phone path based on the voice features of the game voice, thereby obtaining i x j non-key single-phone probabilities of i non-key combined phone paths;

thus, the step "determining the non-keyword probability based on i non-key combined phoneme probabilities of i non-key combined phoneme paths" may comprise the steps of:

the non-key word probabilities are determined based on i x j non-key single-phone probabilities of the i non-key combination phone paths.

For example, referring to fig. 1h, the non-keyword path consists of non-key combined phoneme paths a, o, e …, eng, wherein the non-key combined phoneme path a in turn consists of non-key monophonic paths a, o, e …, eng.

In some embodiments, referring to fig. 1h, each non-key combined phoneme path further includes a fallback path, which may be used to repeatedly detect a plurality of nonsense non-key phonemes, and the structure may effectively exhaust all non-key words, for example, for the speech "aaaa", the non-key single phoneme path aa of the non-key combined phoneme path a may be passed through, and then identified by passing through the fallback path again through the non-key single phoneme path aa of the non-key combined phoneme path a.

In some embodiments, each non-critical combined phoneme path may also include a non-speech path that may be used to identify non-speech portions of the player's speech. For example, for speech "a _", which consists of phoneme a and non-speech _ parts, a non-speech path through non-critical combined phoneme path a is identified.

Therefore, in some embodiments, the non-key combination phoneme paths further include a non-speech path, and the determining the non-keyword probability based on the i × j non-key single-phoneme probabilities of the i non-key combination phoneme paths further includes:

Specifically, similar to the keyword path, in some embodiments, the non-key monophonic paths may include k state models, k is a positive integer, the game speech includes a plurality of speech segments, the speech features of the game speech include speech features of the speech segments, each non-key monophonic path includes a corresponding non-key monophonic model, and determining the non-key monophonic probability corresponding to each non-key monophonic path based on the speech features of the game speech includes:

determining a voice segment corresponding to the state model;

105. And determining the voice content of the game voice based on the keyword probability and the non-keyword probability of the N keywords, thereby realizing game voice recognition of the player.

In some embodiments, step 105 may include the steps of:

The steps of detecting the keywords in the voice recognition are introduced, the game voice stream can be detected in real time by the scheme, and the detection method can adopt a sliding window detection method for detection.

For example, in some embodiments, a detection window with a length of 1.5 seconds may be used to perform voice detection on the voice stream, and if no keyword is detected in the detection window of 1.5 seconds, the detection window is shifted backward by 0.5 seconds on the voice stream to continue the next detection; if a keyword is detected within the 1.5 second detection window, the detection window is shifted back 1.5 seconds on the voice stream.

After the voice content is recognized, man-machine interaction can be carried out based on the voice content, so that the step of determining the voice content of the game voice so as to realize game voice recognition on a player comprises the following steps of:

After the voice content is recognized, keyword shielding can be performed, so that the step of determining the voice content of the game voice so as to realize game voice recognition on the player can comprise the following steps:

and when the voice content of the game voice is the keyword content corresponding to the keyword path, performing voice shielding operation on the game voice. As can be seen from the above, the embodiment of the present invention can acquire a game voice of a player and a decoding network, where the decoding network includes a non-keyword path and N keyword paths connected in parallel with the non-keyword path, each keyword path corresponds to a keyword content, and N is a positive integer; extracting the characteristics of the game voice to obtain the voice characteristics of the game voice; determining keyword probabilities of N keywords corresponding to the game voice based on the voice characteristics in the N keyword paths; determining the probability of the non-keyword corresponding to the game language based on the voice characteristics in the non-keyword path; and determining the voice content of the game voice based on the keyword probability and the non-keyword probability of the N keywords, thereby realizing game voice recognition of the player.

Thus, in the present invention, a decoding network is provided, which has a non-keyword path that can absorb all different non-keywords, and a plurality of keyword paths corresponding to different keywords; the non-keyword path can absorb non-keywords formed by combining various different phonemes, so that the decoding network can accurately and quickly identify the keywords and the non-keywords. Therefore, the voice recognition efficiency is improved.

The method described in the above embodiments is further described in detail below.

In this embodiment, the method of the embodiment of the present invention will be described in detail by taking the voice operation of the game as an example.

When the scheme is applied to game application, in order to balance the complexity and the accuracy of voice recognition, according to the actual requirements of a game, for game scenes which require low recognition calculation amount and moderate accuracy, the scheme adopts an HMM-GMM model to perform voice recognition; for a game scene with high recognition precision, the scheme adopts an HMM-DNN model to perform voice recognition.

In order to identify all keywords and non-keywords in the player's voice, referring to fig. 1h, the present solution proposes a decoding network, which includes 2 keyword paths and 1 non-keyword path, wherein the first keyword path corresponds to the keyword "caution", the 2 nd keyword path corresponds to the keyword "preparation", and the non-keyword path includes non-key combination phoneme paths a, o, e …, eng; wherein, the non-key combined phoneme path a comprises non-key monophonic paths aa, ao, ae … and aeng; the non-key combined phoneme path o comprises non-key single-phoneme paths oa, oo, oe … and oeng; …, wherein the non-key combined phoneme path eng includes non-key single-phoneme paths enga, engo, enge … and engeng.

It should be noted that in some embodiments, the combined phoneme may be a phoneme combined two by two, a phoneme combined three by three, a phoneme combined four by four, and so on; when the phoneme is composed of three and three, the non-keyword path may include a non-key three and three combined phoneme path, the non-key three and three combined phoneme path may include a plurality of non-key pairwise combined phoneme paths, and the non-key pairwise combined phoneme path may include a plurality of non-key monophone paths.

Specifically, for an N-gram language model, when the non-keyword path includes only non-key monophonic elements, the non-keyword path may be a 1-gram language model, and when the non-keyword path includes combined phonemes in which groups are combined two by two, the non-keyword path may be a 2-gram language model; when the non-keyword path includes a set of three-combined phonemes, the non-keyword path may be a 3-gram language model; when the non-keyword path includes a set of four-four combined phonemes, the non-keyword path may be a 4-gram language model, and so on.

The N-Gram is a Language Model commonly used in large vocabulary continuous speech recognition, and for Chinese, we refer to it as a Chinese Language Model (CLM). The Chinese language model can realize automatic conversion to Chinese characters by using collocation information between adjacent words in the context.

In the following, the combined path may be an N-gram language model:

for example, referring to fig. 2a, the non-keyword path includes non-key three-combined phoneme paths a, o, e …, eng; wherein, the non-key three-three combined phoneme path a comprises non-key two-two phoneme paths aa, ao, ae … and aeng; the non-key two-by-two phoneme path ao may include a non-key monophonic path aoa, aoo, aoe …, aoeng …, and so on.

Similarly, for example, referring to fig. 2b, non-key four-component phoneme paths a, o, e …, eng are included in the non-key word path; wherein, the non-key four-four combined phoneme path a comprises non-key three-three phoneme paths aa, ao, ae … and aeng; the non-critical tri-triphone path ao may include non-critical pairwise phoneme paths aoa, aoo, aoe …, aoeng …; the non-key two-by-two phoneme paths aoa may include non-key monophonic paths aoaa, aoao, aoae …, aoae ng … …, and so on.

Because the complexity of the non-key four-four combined phoneme path is high, in the scheme, in order to reduce the complexity of the non-key word path and improve the recognition efficiency, the non-key word path only comprises two non-key combined phoneme paths.

It should be noted that, in the non-keyword paths shown in fig. 2a and fig. 2b, a fallback path may also be included; for example, each time a non-key phone path is passed, the non-key phone path may be returned to the starting point of the non-key path through the fallback path, and the non-key phone detection may be performed repeatedly.

In some embodiments, after the non-keyword probabilities are obtained, the non-key phoneme combinations with lower probabilities may be discarded, so that the scale of the search is reduced when the keyword probabilities and the non-keyword probabilities of the N keywords are compared in terms of probability, thereby reducing the complexity of the non-keyword path.

In some embodiments, the method can achieve the purpose of detecting whether the voice data stream of the player in the game contains a certain part of keywords in real time and feeding back the detection result, thereby enriching the application scenes of the voice in the game, improving the interaction efficiency and the interestingness of the game and improving the game immersion of the player.

In some embodiments, the scheme can realize voice supervision, and realize the function of monitoring partial forbidden words in real time (such as dirty word filtering) so as to purify the game environment.

From the above, the present invention provides a decoding network for speech recognition based on phonemes, the decoding network having a non-keyword path capable of absorbing all different non-keywords, and a plurality of keyword paths corresponding to different keywords; the non-keyword path can absorb non-keywords formed by combining multiple phonemes in pairs, so that the decoding network can accurately and quickly identify the keywords and the non-keywords, and meanwhile, the complexity of voice recognition is reduced. Therefore, the voice recognition efficiency is improved.

In order to better implement the above method, an embodiment of the present invention further provides a language identification device, which may be specifically integrated in an electronic device, where the electronic device may be a terminal, a server, or the like. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices; the server may be a single server or a server cluster composed of a plurality of servers.

For example, in this embodiment, the method according to the embodiment of the present invention will be described in detail by taking the example that the language identification device is specifically integrated in XX.

For example, as shown in fig. 3, the language identification apparatus may include an acquisition unit 301, a feature unit 302, a keyword unit 303, a non-keyword unit 304, and an identification unit 305, as follows:

an acquisition unit 301.

The obtaining unit 301 may be configured to obtain a game voice of a player and a decoding network, where the decoding network may include a non-keyword path and N keyword paths connected in parallel to the non-keyword path, each keyword path corresponding to a keyword content, and N is a positive integer.

And (ii) a feature cell 302.

The feature unit 302 may be configured to perform feature extraction on the game speech to obtain a speech feature of the game speech.

And (iii) a keyword unit 303.

Keyword unit 303 may be configured to determine keyword probabilities of N keywords corresponding to the game voice based on the voice characteristics in the N keyword paths.

In some embodiments, x state models may be included in the keyword path, and the keyword unit 303 may be configured to:

(IV) non-keyword elements 304.

The non-keyword unit 304 may be configured to determine a probability of a non-keyword corresponding to the game language based on the speech features in the non-keyword path.

In some embodiments, the non-keyword paths may include i non-key combined phoneme paths, i being a positive integer, and the non-keyword units 304 may include combined phoneme subunits and non-keyword subunits as follows:

(1) the combined phoneme subunit may be configured to determine, based on a voice feature of the game voice, a non-key combined phoneme probability in each non-key combined phoneme path, so as to obtain i non-key combined phoneme probabilities of the i non-key combined phoneme paths.

(2) And the non-keyword subunit is used for determining the probability of the non-keyword based on the i non-key combined phoneme probabilities of the i non-key combined phoneme paths.

In some embodiments, the combined phoneme subunit may be configured to determine non-key monophonic probabilities in each of the non-key monophonic paths based on speech characteristics of the game speech, thereby obtaining i × j non-key monophonic probabilities for the i non-key combined phoneme paths;

in some embodiments, the non-keyword sub-unit may be configured to determine the non-keyword probability based on i x j non-key single-phoneme probabilities of the i non-key combined-phoneme paths.

In some embodiments, the non-critical monophonic paths may include k state models, k being a positive integer, the game speech may include a plurality of speech segments, the speech features of the game speech may include speech features of the speech segments, each non-critical monophonic path may include a corresponding non-critical monophonic model, and the combined phoneme subunits may be configured to:

determining a voice segment corresponding to the state model;

In some embodiments, the non-key combination phoneme path may further include a non-speech path, and the non-keyword subunit may be further configured to:

(five) an identification unit 305.

The recognition unit 305 may be configured to determine the voice content of the game voice based on the keyword probabilities and the non-keyword probabilities of the N keywords, thereby implementing game voice recognition for the player.

In some embodiments, the identifying unit 305 may be configured to:

In some embodiments, the recognition unit 305 may be configured to perform a voice masking operation on the game voice when the voice content of the game voice is the keyword content corresponding to the keyword path.

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, the speech recognition apparatus of the embodiment includes an obtaining unit, configured to obtain a game voice of a player and a decoding network, where the decoding network includes a non-keyword path and N keyword paths connected in parallel with the non-keyword path, each keyword path corresponds to a keyword content, and N is a positive integer; extracting the characteristics of the game voice by the characteristic unit to obtain the voice characteristics of the game voice; determining keyword probabilities of N keywords corresponding to game voices based on voice characteristics in the N keyword paths by the keyword unit; determining the probability of the non-keyword corresponding to the game language based on the voice characteristics by the non-keyword unit in the non-keyword path; the recognition unit determines the voice content of the game voice based on the keyword probability and the non-keyword probability of the N keywords, thereby realizing game voice recognition for the player.

Therefore, the embodiment of the invention can improve the voice recognition efficiency.

Correspondingly, the embodiment of the present application further provides a computer device, where the computer device may be a terminal or a server, and the terminal may be a terminal device such as a smart phone, a tablet computer, a notebook computer, a touch screen, a game machine, a Personal computer, and a Personal Digital Assistant (PDA).

As shown in fig. 4, fig. 4 is a schematic structural diagram of a computer device 400 according to an embodiment of the present application, where the computer device 400 includes a processor 401 having one or more processing cores, a memory 402 having one or more computer-readable storage media, and a computer program stored in the memory 402 and running on the processor. The processor 401 is electrically connected to the memory 402. Those skilled in the art will appreciate that the computer device configurations illustrated in the figures are not meant to be limiting of computer devices and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.

The processor 401 is a control center of the computer device 400, connects the respective parts of the entire computer device 400 using various interfaces and lines, performs various functions of the computer device 400 and processes data by running or loading software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby monitoring the computer device 400 as a whole.

In the embodiment of the present application, the processor 401 in the computer device 400 loads instructions corresponding to processes of one or more application programs into the memory 402 according to the following steps, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Optionally, as shown in fig. 4, the computer device 400 further includes: touch-sensitive display screen 403, radio frequency circuit 404, audio circuit 405, input unit 406 and power 407. The processor 401 is electrically connected to the touch display screen 403, the radio frequency circuit 404, the audio circuit 405, the input unit 406, and the power source 407. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 4 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.

The touch display screen 403 may be used for displaying a graphical user interface and receiving operation instructions generated by a user acting on the graphical user interface. The touch display screen 403 may include a display panel and a touch panel. The display panel may be used, among other things, to display information entered by or provided to a user and various graphical user interfaces of the computer device, which may be made up of graphics, text, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations of a user on or near the touch panel (for example, operations of the user on or near the touch panel using any suitable object or accessory such as a finger, a stylus pen, and the like), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 401, and can receive and execute commands sent by the processor 401. The touch panel may overlay the display panel, and when the touch panel detects a touch operation thereon or nearby, the touch panel may transmit the touch operation to the processor 401 to determine the type of the touch event, and then the processor 401 may provide a corresponding visual output on the display panel according to the type of the touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 403 to realize input and output functions. However, in some embodiments, the touch panel and the touch panel can be implemented as two separate components to perform the input and output functions. That is, the touch display screen 403 may also be used as a part of the input unit 406 to implement an input function.

In the embodiment of the present application, a game application is executed by the processor 401 to generate a graphical user interface on the touch display screen 403, where a virtual scene on the graphical user interface includes at least one skill control area, and the skill control area includes at least one skill control. The touch display screen 403 is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface.

The rf circuit 404 may be used for transceiving rf signals to establish wireless communication with a network device or other computer device via wireless communication, and for transceiving signals with the network device or other computer device.

The audio circuit 405 may be used to provide an audio interface between a user and a computer device through speakers, microphones. The audio circuit 405 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 405 and converted into audio data, which is then processed by the audio data output processor 401, and then sent to, for example, another computer device via the radio frequency circuit 404, or output to the memory 402 for further processing. The audio circuit 405 may also include an earbud jack to provide communication of a peripheral headset with the computer device.

The input unit 406 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 407 is used to power the various components of the computer device 400. Optionally, the power source 407 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, power consumption management, and the like through the power management system. The power supply 407 may also include one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, or any other component.

Although not shown in fig. 4, the computer device 400 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described in detail herein.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

From the above, the computer device provided by the embodiment can improve the speech recognition efficiency.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer-readable storage medium, in which a plurality of computer programs are stored, and the computer programs can be loaded by a processor to execute the steps in any one of the speech recognition methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the storage medium can execute the steps in any of the speech recognition methods provided in the embodiments of the present application, the beneficial effects that can be achieved by any of the speech recognition methods provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described again here.

The foregoing describes in detail a speech recognition method, apparatus, storage medium, and computer device provided in the embodiments of the present application, and specific examples are applied herein to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A speech recognition method, comprising:

obtaining game voice of a player and a decoding network, wherein the decoding network comprises a non-keyword path and N keyword paths connected with the non-keyword path in parallel, each keyword path corresponds to a keyword content, and N is a positive integer;

determining keyword probabilities of N keywords corresponding to the game voice in the N keyword paths based on the voice features; and the number of the first and second groups,

2. The speech recognition method of claim 1, wherein the non-keyword path comprises i non-key combination phoneme paths, wherein i is a positive integer, and wherein determining the non-keyword probability corresponding to the game language based on the speech feature in the non-keyword path comprises:

determining non-key combination phoneme probabilities in each non-key combination phoneme path based on the voice features of the game voice, thereby obtaining i non-key combination phoneme probabilities of the i non-key combination phoneme paths;

determining the non-keyword probability based on i non-key combined phoneme probabilities of the i non-key combined phoneme paths.

3. The speech recognition method of claim 2, wherein the non-key combination phoneme paths comprise j non-key combination phoneme paths connected in parallel, wherein j is a positive integer, and wherein determining non-key combination phoneme probabilities in each non-key combination phoneme path based on the speech features of the game speech to obtain i non-key combination phoneme probabilities of the i non-key combination phoneme paths comprises:

determining non-key monophone probabilities in each non-key monophone path based on the voice features of the game voice, thereby obtaining i x j non-key monophone probabilities of i non-key combination monophone paths;

the determining the non-keyword probability based on i non-key combined phoneme probabilities of the i non-key combined phoneme paths comprises:

determining the non-key word probabilities based on i x j non-key single-phone probabilities of the i non-key combination phone paths.

4. The speech recognition method of claim 3 wherein the non-critical monophonic paths include k state models, k being a positive integer, the game speech includes a plurality of speech segments, the speech features of the game speech include speech features of the speech segments, each non-critical monophonic path includes a corresponding non-critical monophonic model, and determining the non-critical monophonic probability for each non-critical monophonic path based on the speech features of the game speech comprises:

determining a voice segment corresponding to the state model;

calculating the state probability of the state corresponding to the voice fragment according to the voice characteristics of the voice fragment based on the state model in the non-key single-phone path, thereby determining k state probabilities corresponding to k state submodels in the non-key single-phone path;

determining a non-key monophonic probability to which the non-key monophonic path corresponds based on the k state probabilities.

5. The speech recognition method of claim 3 wherein the non-key combination phoneme paths further include a non-speech path, and wherein determining the non-key word probabilities based on i x j non-key monophonic probabilities of the i non-key combination phoneme paths further comprises:

determining the non-voice probability corresponding to each non-voice path based on the voice characteristics of the game voice, thereby obtaining i non-voice probabilities of the non-voice paths in i non-key combined phoneme paths;

and determining the non-key word probability based on i x j non-key single phone probabilities of the i non-key combined phone paths and i non-voice probabilities of non-voice paths in the i non-key combined phone paths.

6. The speech recognition method of claim 1, wherein the keyword path comprises x state models, and wherein determining keyword probabilities of N keywords corresponding to the game speech based on the speech features in the N keyword paths comprises:

determining a voice segment corresponding to a state model in the keyword path;

7. A speech recognition method according to claim 4 or 6, characterized in that the state models are constituted by hidden Markov models and Gaussian mixture models or deep neural network models.

8. The speech recognition method of claim 1, wherein the determining the speech content of the game speech based on the keyword probabilities and the non-keyword probabilities of the N keywords comprises:

when the maximum probability result is the non-keyword probability, determining that the voice content of the game voice is the non-keyword content;

9. The voice recognition method of claim 8, wherein the determining the voice content of the game voice to thereby enable game voice recognition for the player comprises:

and enabling the game character operated by the player to execute the game interaction operation in the game, thereby realizing the voice interaction between the player and the game.

10. The voice recognition method of claim 8, wherein the determining the voice content of the game voice to thereby enable game voice recognition for the player comprises:

and when the voice content of the game voice is the keyword content corresponding to the keyword path, performing voice shielding operation on the game voice.

11. A speech recognition apparatus, comprising:

the game device comprises an acquisition unit, a decoding unit and a processing unit, wherein the acquisition unit is used for acquiring game voice of a player and a decoding network, the decoding network comprises a non-keyword path and N keyword paths connected with the non-keyword path in parallel, each keyword path corresponds to a keyword content, and N is a positive integer;

the feature unit is used for performing feature extraction on the game voice to obtain the voice feature of the game voice;

a non-keyword unit, configured to determine, in the non-keyword path, a non-keyword probability corresponding to the game language based on the voice feature;

and the recognition unit is used for determining the voice content of the game voice based on the keyword probability and the non-keyword probability of the N keywords so as to realize game voice recognition of the player.

12. A terminal comprising a processor and a memory, said memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps of the speech recognition method according to any one of claims 1 to 10.

13. A computer-readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the speech recognition method according to any one of claims 1 to 10.