CN114333768A

CN114333768A - Voice detection method, device, equipment and storage medium

Info

Publication number: CN114333768A
Application number: CN202111128507.1A
Authority: CN
Inventors: 朱传聪; 孙思宁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2022-04-12

Abstract

The application relates to the technical field of computer processing, and provides a voice detection method, a device, equipment, a readable storage medium and a program product, which can accurately confirm the position of a keyword in voice under the condition of ensuring the keyword detection efficiency, and comprises the following steps: candidate phoneme likeliness corresponding to each voice frame in the voice to be detected; filtering a speech frame with phoneme blanks possibly based on the candidate phonemes corresponding to the speech frame, and obtaining a recognition phoneme sequence corresponding to the speech to be detected based on the candidate phoneme possibility corresponding to the speech frame with phonemes; determining an extracted voice frame sequence corresponding to a target detection word in the voice to be detected based on the position of the voice frame sequence corresponding to the phoneme sequence of the target detection word in the voice frame sequence of the voice to be detected; acquiring phoneme probability distribution corresponding to each extracted voice frame in the extracted voice frame sequence; and obtaining a voice detection result corresponding to the voice to be detected based on the phoneme probability distribution corresponding to the extracted voice frame.

Description

Voice detection method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer processing technologies, and in particular, to a method and an apparatus for voice detection, a computer device, a computer-readable storage medium, and a computer program product.

Background

With the advent of the intelligent age, artificial intelligence technology is widely used in various fields, such as the image recognition field and the speech recognition field. One of the technical branches involved in speech recognition based on artificial intelligence technology is Keyword detection (Keyword Spotting); the keyword detection can be applied to a voice awakening system, a user does not need to manually operate and control equipment, and operation and control intellectualization are achieved.

When voice needs to be detected, feature extraction can be performed on the voice, and a voice detection result is obtained based on the extracted features.

Disclosure of Invention

In view of the above, it is necessary to provide a voice detection method, apparatus, computer device, computer readable storage medium and computer program product for solving the above technical problems.

A method of speech detection, the method comprising: extracting the characteristics of each voice frame in the voice to be detected to obtain characteristic vectors corresponding to the voice frames, and sequencing the characteristic vectors corresponding to the voice frames according to the corresponding voice frame sequence to obtain a characteristic vector sequence; performing phoneme recognition based on the feature vector sequence to obtain candidate phoneme likelihoods corresponding to the voice frames; filtering a speech frame with phoneme blanks possibly based on the candidate phonemes corresponding to the speech frame, and obtaining a recognition phoneme sequence corresponding to the speech to be detected based on the candidate phoneme possibility corresponding to the speech frame with phonemes; using a phoneme sequence corresponding to a target detection word selected from the recognition phoneme sequence as a detection word phoneme sequence, and determining an extracted voice frame sequence corresponding to the target detection word in the voice to be detected based on the position of a voice frame sequence corresponding to the detection word phoneme sequence in the voice frame sequence of the voice to be detected; acquiring phoneme probability distribution corresponding to each extracted voice frame in the extracted voice frame sequence; the phoneme probability distribution corresponding to the extracted voice frame is obtained by respectively carrying out voice frame phoneme detection according to the feature vectors corresponding to the extracted voice frame; and obtaining a voice detection result corresponding to the voice to be detected based on the phoneme probability distribution corresponding to the extracted voice frame.

A speech detection apparatus, the apparatus comprising: the characteristic vector processing module is used for extracting the characteristics of each voice frame in the voice to be detected to obtain a characteristic vector corresponding to the voice frame, and sequencing the characteristic vectors corresponding to the voice frames according to the corresponding voice frame sequence to obtain a characteristic vector sequence; a candidate phoneme possibility degree obtaining module, configured to perform phoneme recognition based on the feature vector sequence to obtain candidate phoneme possibility degrees corresponding to the speech frames; a phoneme sequence obtaining module, configured to filter a speech frame with phoneme blanks based on the candidate phonemes corresponding to the speech frame, and obtain an identification phoneme sequence corresponding to the speech to be detected based on the candidate phoneme likeliness corresponding to the speech frame with phonemes; a voice frame extraction module, configured to use a phoneme sequence corresponding to a target detection word selected from the recognition phoneme sequence as a detection word phoneme sequence, and determine, based on a position of a voice frame sequence corresponding to the detection word phoneme sequence in a voice frame sequence of the voice to be detected, an extracted voice frame sequence corresponding to the target detection word in the voice to be detected; a phoneme possibility degree distribution obtaining module, configured to obtain a phoneme possibility degree distribution corresponding to each extracted speech frame in the extracted speech frame sequence; the phoneme probability distribution corresponding to the extracted voice frame is obtained by respectively carrying out voice frame phoneme detection according to the feature vectors corresponding to the extracted voice frame; and the voice detection result acquisition module is used for acquiring a voice detection result corresponding to the voice to be detected based on the phoneme probability distribution corresponding to the extracted voice frame.

In some embodiments, the speech frame extraction module is further configured to determine an initial position and an end position of a speech frame sequence corresponding to the detection word phoneme sequence in the speech to be detected; taking a forward position corresponding to the starting position as a first extraction position, and taking a backward position corresponding to the ending position as a second extraction position; and taking a voice frame sequence between the first extraction position and the second extraction position in the voice to be detected as an extracted voice frame sequence corresponding to the target detection word.

In some embodiments, the speech frame extraction module is further configured to use the number of speech frames with speech elements blank in the speech to be detected as the number of blank frames; determining the extension number of the voice frames based on the number of the blank frames; the extended number of the voice frames and the number of the blank frames form a positive correlation; and taking a forward position with the distance from the starting position as the extension number of the voice frames as a first extraction position, and taking a backward position with the distance from the ending position as the extension number of the voice frames as a second extraction position.

In some embodiments, the candidate phone likelihoods include likelihoods corresponding to phone whitespaces; the phoneme sequence acquisition module is further configured to use the speech frame with the probability degree greater than the probability threshold corresponding to the phoneme blank as the speech frame with the phoneme blank; filtering the voice frame of the phoneme blank to obtain a decoded voice frame, and acquiring a voice decoding network formed based on the candidate phoneme possibility corresponding to the decoded voice frame; decoding is carried out based on the voice decoding network to obtain a target decoding path, and phonemes passed by the target decoding path are arranged according to a path sequence to obtain a recognition phoneme sequence corresponding to the voice to be detected.

In some embodiments, the feature vector processing module is further configured to input each speech frame in the speech to be detected to a feature extraction sub-model of a trained speech detection model for feature extraction, so as to obtain a feature vector corresponding to the speech frame; the trained voice detection model comprises a feature extraction submodel, a voice sequence recognition submodel and a voice frame phoneme detection submodel, wherein the voice sequence recognition submodel and the voice frame phoneme detection submodel are respectively connected with the feature extraction submodel; the candidate phoneme likelihood acquiring module is further configured to input the feature vector sequence into the speech sequence recognition submodel for phoneme recognition, so as to obtain candidate phoneme likelihoods corresponding to the speech frames; the phoneme possibility degree distribution acquisition module is further used for extracting the feature vectors corresponding to the extracted voice frames from the feature vector sequence, and inputting the feature vectors corresponding to the extracted voice frames into the voice frame phoneme detection sub-model for phoneme recognition; and taking the possible degree distribution of each class of phoneme output by the voice frame phoneme detection submodel in each extracted voice frame as the possible degree distribution of the phoneme corresponding to each extracted voice frame.

In some embodiments, the apparatus further comprises a speech detection model training module to obtain training speech; inputting the training voice into a feature extraction submodel to be trained for feature extraction to obtain training feature vectors corresponding to the training voice frames, and sequencing the training feature vectors corresponding to the training voice frames according to the corresponding voice frame sequence to obtain a training vector sequence; inputting the training vector sequence into a voice sequence recognition sub-model to be trained for phoneme recognition to obtain a phoneme sequence recognition result; obtaining a first model loss value based on a first difference between the phoneme sequence recognition result and a standard recognition result corresponding to the training speech; the first model loss value is positively correlated with the first difference; inputting each training feature vector into the voice frame phoneme detection submodel to be trained for phoneme recognition to obtain a voice frame phoneme detection result corresponding to the voice frame phoneme detection submodel; obtaining a second model loss value based on a second difference between the speech frame phoneme detection result and a standard recognition result corresponding to the training speech; the first model loss value is in positive correlation with the second difference; deriving a target model loss value based on the first model loss value and the second model loss value; and adjusting parameters of the feature extraction submodel to be trained, the voice frame phoneme detection submodel to be trained and the voice sequence identification submodel to be trained based on the loss value of the target model, and forming the trained voice detection model by the feature extraction submodel after parameter adjustment, the voice frame phoneme detection submodel after parameter adjustment and the voice sequence identification submodel after parameter adjustment.

In some embodiments, the candidate phoneme likelihood obtaining module is further configured to sequentially obtain current feature vectors in the feature vector sequence according to the order of the feature vector sequence; acquiring a phoneme expression vector corresponding to a previous feature vector corresponding to the current feature vector; the phoneme representation vector corresponding to the previous feature vector is a representation vector corresponding to a phoneme obtained by inputting the previous feature vector into the speech sequence recognition submodel for phoneme recognition; and inputting the phoneme expression vector corresponding to the previous feature vector and the current feature vector into the speech sequence recognition submodel for phoneme recognition to obtain the candidate phoneme likelihood corresponding to the current speech frame corresponding to the current feature vector.

In some embodiments, the apparatus further includes a phoneme representation vector obtaining module, configured to input the previous feature vector into the speech sequence recognition submodel for phoneme recognition, so as to obtain a candidate phoneme likelihood corresponding to the previous feature vector; the candidate phoneme likelihood corresponding to the previous feature vector comprises the likelihood corresponding to each candidate phoneme in the candidate phoneme set; selecting a candidate phoneme with the maximum corresponding possibility from the candidate phoneme set as a target phoneme; and taking the phoneme representation vector corresponding to the target phoneme as the phoneme representation vector corresponding to the previous feature vector.

In some embodiments, the apparatus further includes a phoneme recognition continuation module, configured to, when the feature vectors in the feature vector sequence are not subjected to phoneme recognition, return to the step of sequentially obtaining the current feature vectors in the feature vector sequence according to the order of the feature vector sequence, so as to perform phoneme recognition on the feature vectors that are not subjected to phoneme recognition until the feature vectors in the feature vector sequence are subjected to phoneme recognition.

In some embodiments, the speech detection result obtaining module is further configured to obtain a detection word position of the target detection word in the speech to be detected based on the phoneme likelihood distribution corresponding to the extracted speech frame; acquiring the voice of the backward position corresponding to the position of the detection word in the voice to be detected as instruction detection voice; and carrying out voice instruction detection on the instruction detection voice to obtain a target voice instruction corresponding to the instruction detection voice, so as to carry out voice control based on the target voice instruction.

A computer device comprising a memory storing a computer program and a processor performing the above method.

A computer-readable storage medium, on which a computer program is stored, which computer program is executed by a processor for performing the above-mentioned method.

A computer program product comprising a computer program which, when executed by a processor, implements the above-described method.

The voice detection method, the voice detection device, the computer equipment, the computer readable storage medium and the computer program product comprise the following steps: extracting the characteristics of each voice frame in the voice to be detected to obtain characteristic vectors corresponding to the voice frames, and sequencing the characteristic vectors corresponding to the voice frames according to the corresponding voice frame sequence to obtain a characteristic vector sequence; performing phoneme recognition based on the feature vector sequence to obtain candidate phoneme likelihoods corresponding to the voice frames; filtering a speech frame with phoneme blanks possibly based on the candidate phonemes corresponding to the speech frame, and obtaining a recognition phoneme sequence corresponding to the speech to be detected based on the candidate phoneme possibility corresponding to the speech frame with phonemes; the phoneme sequence corresponding to a target detection word selected from the recognition phoneme sequence is used as a detection word phoneme sequence, and an extracted voice frame sequence corresponding to the target detection word in the voice to be detected is determined based on the position of a voice frame sequence corresponding to the detection word phoneme sequence in the voice frame sequence of the voice to be detected; acquiring phoneme probability distribution corresponding to each extracted voice frame in the extracted voice frame sequence; extracting phoneme probability distribution corresponding to the voice frame is obtained by respectively carrying out voice frame phoneme detection according to the extracted feature vectors corresponding to the voice frame; and obtaining a voice detection result corresponding to the voice to be detected based on the phoneme probability distribution corresponding to the extracted voice frame. Because the phoneme blank speech frame is possibly filtered based on the candidate phonemes corresponding to the speech frame and the recognized phoneme sequence is obtained based on the candidate phoneme possibility corresponding to the speech frame in which the phoneme exists, the acquisition efficiency of the recognized phoneme sequence is improved, and the efficiency of acquiring the phoneme sequence corresponding to the target detection word from the recognized phoneme sequence is also improved; then, based on the position of the voice frame sequence corresponding to the phoneme sequence of the detection word in the voice frame sequence of the voice to be detected, determining an extracted voice frame sequence corresponding to the target detection word, and subsequently, performing phoneme detection on all voice frames is not needed, so that the efficiency is improved; moreover, since frame-by-frame phoneme detection is performed on the extracted voice frame sequence, the position of the keyword in the voice can be accurately reflected according to the voice detection result obtained by the phoneme probability distribution of each extracted voice frame obtained by the frame-by-frame phoneme detection, and the accuracy is improved.

Drawings

FIG. 1(a) is a schematic diagram of an application scenario of a speech detection method in some embodiments;

FIG. 1(b) is a schematic flow chart of a speech detection method in some embodiments;

FIG. 2 is a schematic diagram of a speech decoding network for non-frame-by-frame decoding search in some embodiments;

FIG. 3 is a schematic diagram of a speech decoding network for frame-by-frame decoding search in some embodiments;

FIG. 4 is a flow diagram illustrating a method for speech detection in some embodiments;

FIG. 5 is a schematic diagram of a model architecture in some embodiments;

FIG. 6 is a schematic diagram of a model architecture in some embodiments;

FIG. 7 is a flow diagram illustrating a method for speech detection in some embodiments;

FIG. 8 is a diagram illustrating phoneme state transitions corresponding to keywords in some embodiments;

FIG. 9 is a diagram illustrating state transitions of speech frames in some embodiments;

FIG. 10 is a schematic diagram of model forward inference and model deployment in some embodiments;

FIG. 11 is a block diagram of a speech detection device in some embodiments;

FIG. 12 is a diagram of the internal structure of a computer device in some embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Reference in the present application to "embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments can be included in at least some embodiments of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The terms referred to in the schemes provided in this application are introduced:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology and Speech synthesis Technology, as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Keyword detection (KWS) belongs to a branch of speech recognition technology, and the Keyword detection task mainly includes: identifying whether the voice has the keywords and confirming the accurate positions of the keywords in the voice; the position of the keyword in the speech includes a start position or an end position of the keyword in the speech. The keyword can be a wake-up word, after a user sends out a voice corresponding to the wake-up word, the voice recognition system is awakened, and the voice recognition system starts to detect the voice of the user after sending out the voice corresponding to the wake-up word so as to determine an operation instruction sent by the user; illustratively, if the awakening word is 'Xiaoming', after the user sends out the voice corresponding to 'Xiaoming', the voice recognition system is awakened and starts to detect whether the user sends out a voice operation instruction such as 'turn on air conditioner'. The keyword can also be a voice operation instruction, and the user can directly speak the voice operation instruction without speaking the awakening word first and then speaking the voice operation instruction.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and Deep learning generally include Deep Neural Network (DNN), belief Network, reinforcement learning, transfer learning, inductive learning, formula learning, and other techniques. The deep Neural Network includes a Forward Sequential Memory Network (FSMN) and a Convolutional Neural Network (CNN).

In the deep learning process, a Cross Entropy Loss Function (CE Loss Function) can be used for model training. Deep Learning can be divided into single-Task Learning and Multi-Task Learning (MTL) according to the number of tasks (tasks) that the model needs to execute; single-task learning is mainly to learn one task at a time, and multi-task learning is to put multiple related tasks together to learn, and to learn multiple tasks simultaneously.

The training of the speech recognition model comprises acoustic model training, which can be sequence-to-sequence model training, and the model training is performed only by using a training speech frame sequence as input and a training speech sequence as output without performing frame alignment on training speech in advance (namely, without performing phoneme labeling on each training speech frame).

Hidden Markov Models (HMMs) belong to the Markov chain, whose states cannot be observed directly, but can be observed through a sequence of observation vectors, each of which is represented as a state by some probability density distribution, each of which is generated by a state sequence having a corresponding probability density distribution. Thus, the hidden Markov model is a dual stochastic process- -a hidden Markov chain with a certain number of states and a set of display stochastic functions. In a hidden markov model, a state is not directly visible, but the output is visible depending on the state. Each state has a possible probability distribution over the possible output tokens. Thus, the generation of a sequence of labels by an HMM provides information about some sequence of states. Note that "hidden" refers to the sequence of states through which the model passes, not the parameters of the model. Hidden markov models are known for their pattern recognition over time, such as speech, handwriting, gesture recognition, labeling of parts of speech, music scores, partial discharges, and bioinformatics applications.

The scheme provided by the embodiment of the application relates to a voice recognition technology based on artificial intelligence, and can be applied to driving scenes, intelligent home scenes, robot control scenes and other scenes. If the computing power of the equipment such as the vehicle, the intelligent home equipment, the robot and the like is strong, the voice detection method can be directly executed by the equipment such as the vehicle, the intelligent home equipment, the robot and the like; if the computing power of the vehicle, the smart home device, and the robot is weak, the vehicle, the smart home device, or the robot may send the collected voice to the server, so that the server executes the voice detection method, and the server feeds back a result obtained after executing the voice detection method to the device such as the vehicle, the smart home device, or the robot, as shown in fig. 1 (a). In the following description of the embodiments, a device that performs a voice detection method is referred to as a computer device, and the computer device may be a vehicle, a smart home device, a robot, a server, and the like.

The "sequence-to-sequence" model concerns the result of inputting a sequence to an output sequence, i.e. the "sequence-to-sequence" model concerns whether the predicted sequence output by the model itself is identical to the true sequence. In speech recognition, for a speech frame for which it is difficult to identify the corresponding phoneme, the "sequence-to-sequence" model assigns a phoneme blank label to the speech frame, but does not forcibly assign a phoneme-like label to the speech frame; based on the property that a "sequence-to-sequence" model can assign a phoneme blank label to a speech frame for which it is difficult to determine the corresponding phoneme, the "sequence-to-sequence" model can be used to predict whether a target detection word (i.e., a keyword) is present in the speech. As shown in fig. 1(b), after acquiring the speech uttered by the user, the computer device may perform framing processing on the speech to obtain N₀To N₁₀₀₀Inputting the voice frame sequence into a sequence-to-sequence model; the phoneme existence prediction is carried out on the speech frame sequence by using a sequence-to-sequence model to obtain a sequence-to-sequence "The likelihood that the phoneme output by the model exists in each speech frame; if the possible degree of the phoneme existing in the speech frame is less than the preset value, the possible degree of the phoneme existing in the speech frame is considered to be lower, and the speech frame is a speech frame with phoneme blanks; and if the possible degree of the phoneme existing in the speech frame is more than or equal to the preset value, considering that the possible degree of the phoneme existing in the speech frame is higher, and regarding the speech frame as a speech frame with non-empty phonemes.

After performing phoneme existence prediction on each speech frame of the speech frame sequence by using a sequence-to-sequence model, obtaining a probability that a phoneme exists in each speech frame, and determining a phoneme blank speech frame and a phoneme non-blank speech frame, as shown in fig. 1(b), the phoneme blank speech frame includes N₀、N₁、N₂And N₁₀₀₀Etc. the phoneme non-empty speech frame includes N₃、N₄And N₉₉₉And the like.

Then, the computer device may filter the phoneme blank speech frames, retain the phoneme non-blank speech frames to obtain a phoneme non-blank speech frame sequence, and perform a decoding search based on the probable degree distribution of various phonemes over the phoneme non-blank speech frames, e.g., various phonemes over N₃、N₄、N₉₉₅、N₉₉₆、N₉₉₈And N₉₉₉Decoding and searching the possible degree distribution on each phoneme non-empty speech frame; since the decoding search at this stage is for phoneme non-empty speech frames, excluding phoneme blank speech frames, the decoding search at this stage is referred to as a non-frame-by-frame decoding search.

The computer equipment can determine whether the target detection words exist in the voice or not based on the non-frame-by-frame decoding search; for example, the target detection word "turn on the air conditioner" corresponds to a phoneme sequence of "d a k ai k ong t iao", and the phoneme sequence of "d a k ai k ong t iao" in the phoneme non-empty speech frame sequence of the speech is determined to exist through non-frame-by-frame decoding search, so that the speech can be considered to have the target detection word "turn on the air conditioner".

After determining that the target detection word exists in the speech, it may be determined that a sequence of phoneme non-empty speech frames of the phoneme sequence exists. Since the non-frame-by-frame decoding search can only determine the phonemes corresponding to the phoneme non-empty speech frames and cannot determine the phonemes corresponding to the phoneme blank speech frames, only the approximate position of the target detection word in the speech (the approximate position of the target detection word in the speech, which can be understood as the approximate time when the target detection word exists in the speech) can be determined, and it cannot be determined whether the phonemes of the target detection word exist in the plurality of phoneme blank speech frames adjacent to the phoneme non-empty speech frame in which the phonemes of the target detection word exist, that is, the position of the target detection word in the speech cannot be accurately determined. Therefore, the computer device may determine an approximate time instant at which the target detection word is present in the speech based on the sequence of phoneme non-empty speech frames in which the sequence of phonemes of the target detection word is present, and regard the speech frames at a plurality of consecutive time instants adjacent to the approximate time instant as the speech frames in which the phonemes of the target detection word may be present.

Illustratively, if there is a phoneme non-empty speech frame sequence of "d a k ai k ong t iao" this phoneme sequence is N₁₀₀To N₁₅₀Is not a null speech frame, it can be determined that the approximate time instant at which the target detection word is present in the speech is N₁₀₀And N₁₅₀Therefore, it can be considered to be located at N₁₀₀Preceding successive speech frames may have phonemes of the target detection word, e.g. N₈₀To N₁₀₀There may be a phoneme of the target detection word for each speech frame in between; may also be considered to be located at N₁₅₀The subsequent continuous speech frames may have phonemes of the target detection word, e.g. N₁₅₀To N₁₈₀There may be a phoneme of the target detection word for each speech frame in between. Thus, the sequence of speech frames for which it is determined that there may be phonemes for the target detection word is N₈₀To N₁₈₀In between, as shown in FIG. 1(b), at N₈₀To N₁₈₀In between, some speech frames are phoneme-blank speech frames and some speech frames are phoneme-non-blank speech frames.

After determining the speech frame sequence of phonemes in which the target detection word may exist, the computer device obtains the probability distribution of each type of phonemes in each speech frame in the speech frame sequence (which may be referred to as "probability distribution")Phoneme likelihood distribution per speech frame), e.g., for N₁₀₀The phonemes "a", "o", "e" and "d" can be obtained in N₁₀₀Existence of a probability, forming a probability distribution; wherein the phoneme of "a" is in N₁₀₀Probability of Presence characterisation N₁₀₀The probability of the corresponding actual phoneme being "a".

In some embodiments, a pre-constructed phoneme class detection model may be utilized to determine the likelihood of the presence of classes of phonemes in each speech frame. The labels used in training the phoneme type detection model may include labels of various phonemes but not a label representing a phoneme blank, so that the phoneme type detection model may be used to forcibly predict phonemes corresponding to each speech frame, and a speech frame may not be predicted as a phoneme blank. Note that, the labels used in training the "sequence-to-sequence" model include labels representing phoneme blanks, and for a speech frame for which it is difficult to predict a corresponding phoneme, the "sequence-to-sequence" model marks a phoneme blank without forcibly assigning a label for a certain type of phoneme to the speech frame, so as to reduce the model loss value.

The computer device obtains the phoneme probability distribution of each speech frame in the speech frame sequence in which the phoneme of the target detection word may exist, then performs decoding search based on the phoneme probability distribution of each speech frame, and determines the phoneme corresponding to each speech frame in the speech frame sequence in which the phoneme of the target detection word may exist. It should be noted that the decoding search at this stage is performed for each continuous speech frame, and therefore, the decoding search at this stage is referred to as a frame-by-frame decoding search.

The above process of frame-by-frame decoding search is described with reference to fig. 2: based on N₈₀To N₁₈₀The phoneme probability distribution of each speech frame of the speech frame sequence between them and the phoneme transition probability between adjacent speech frames (the transition probability can be given by the language model), a speech decoding network (also called decoding graph) as shown in fig. 2 can be constructed, then the speech decoding network is decoded and searched to obtain the maximum probability path, and the phoneme in the maximum probability path is used as the phoneme in the maximum probability pathFor example, if the first phoneme of the maximum probability path is "d", the phoneme of the maximum probability path may be regarded as N₈₀A corresponding phoneme; likewise, type processing may be performed for other phonemes of the most probable path to determine N₈₀And N₁₈₀The phoneme corresponding to each voice frame between the target detection words accurately determines the position of the target detection words in the voice, and the forced alignment of the voice is realized.

It should be noted that the process of non-frame-by-frame decoding search is similar to the process of frame-by-frame decoding search, and the main difference is that the speech decoding network for decoding search is different, and the speech decoding network for non-frame-by-frame decoding search is based on phoneme non-empty speech frames (e.g. N)₃、N₄、N₉₉₅、N₉₉₆、N₉₉₈And N₉₉₉Etc.) and the transition probabilities of phonemes between adjacent speech frames.

In addition, if the sequence-to-sequence model outputs the possibility that each speech frame is a phoneme blank and the possibility that various phonemes exist, and the possibility that a certain speech frame is a phoneme blank is lower than a preset value, the speech frame can be considered as a phoneme non-empty speech frame, the possibility that various phonemes exist in the speech frame output by the sequence-to-sequence model can be distributed as the phoneme possibility of the speech frame, and the speech decoding network shown in fig. 3 can be constructed by combining the possibility of the phoneme blank (indicated by "-" in fig. 3) and the state transition possibility between adjacent speech frames. Wherein the state transition between adjacent speech frames comprises: a phoneme state to a phoneme state, a phoneme state to a phoneme blank state, and a phoneme blank state to a phoneme blank state; the state transition possibilities between adjacent speech frames can be given by the language model.

In some embodiments, as shown in fig. 4, a speech detection method is provided, which is described by taking the method as an example for being applied to a computer device, and includes the following steps:

step S402, extracting the characteristics of each voice frame in the voice to be detected to obtain the characteristic vectors corresponding to the voice frames, and sequencing the characteristic vectors corresponding to the voice frames according to the corresponding voice frame sequence to obtain a characteristic vector sequence.

The voice to be detected whether the target detection word exists or not can be used as the voice to be detected, and the voice to be detected can be sent by a user or a pronunciation device. The speech to be detected is subjected to framing processing, and a plurality of speech frames can be obtained. The speech frame may be represented by at least one of MFCC features (Mel Frequency Cepstrum Coefficient, Mel Frequency cepstral Coefficient), FBank features (FilterBank), or energy features. The feature vector of the speech frame may be a feature vector describing the speech frame, and may be extracted from at least one of the MFCC feature, the FBank feature, or the energy feature by using a feature extraction model.

The computer device may perform framing processing on the voice to be detected according to a preset frame length, so as to obtain a plurality of voice frames, where the preset frame length may be 20 milliseconds. Then, the computer device can sequence the feature vectors of each speech frame according to the sequence of the speech frames to obtain a feature vector sequence.

Step S404, phoneme recognition is carried out based on the characteristic vector sequence, and candidate phoneme likeliness corresponding to each speech frame is obtained.

Wherein, the phoneme (phone) refers to the minimum phonetic unit divided according to the natural attribute of the voice, and the pronunciation of a word (word) can be composed of one or more phonemes according to the pronunciation action analysis; for example, for Chinese characters, each Chinese character may have a correspondence with a phoneme in a Chinese phonetic symbol.

Phoneme recognition is mainly to detect the likelihood of the presence of a phoneme for each speech frame. The candidate phoneme possibility degree corresponding to the speech frame refers to the possibility of the candidate phoneme existing in the speech frame and the possibility of the candidate not existing in the speech frame, and the possibility degree can be represented by probability; the candidate phoneme likelihood may include a likelihood that the speech frame is a phoneme blank and a likelihood that various classes of phonemes exist in the speech frame. In the candidate phoneme likelihood of a certain speech frame, if the phoneme blank likelihood is less than the likelihood threshold, the speech frame may be considered as a phoneme blank speech frame, and if the phoneme blank likelihood is greater than or equal to the likelihood threshold, the speech frame may be considered as a phoneme non-empty speech frame (i.e., a speech frame with phonemes).

In this step, after obtaining the feature vector sequence corresponding to the speech frame sequence, the computer device performs phoneme recognition based on the feature vector sequence to detect the possibility of the existence of phonemes in each speech frame, obtain the possibility of existence of various phonemes in the speech frame and the possibility of the speech frame being a phoneme blank, and form a candidate phoneme possibility for each speech frame.

Step S406, filtering the speech frame with phoneme blanks based on the candidate phonemes corresponding to the speech frame, so as to obtain an identification phoneme sequence corresponding to the speech to be detected based on the candidate phoneme likeliness corresponding to the speech frame with phonemes.

The speech frame with phoneme blanks refers to speech with phoneme blanks with a possibility degree larger than or equal to a possibility degree threshold, and the speech frame with phonemes (namely, a phoneme non-empty speech frame) refers to speech with phoneme blanks with a possibility degree smaller than the possibility degree threshold; for example, as shown in FIG. 1(b), at N₀To N₁₀₀₀In each speech frame, the phoneme blank speech frame comprises N₀、N₁、N₂And N₁₀₀₀Etc. the phoneme non-empty speech frame includes N₃、N₄And N₉₉₉And the like. Identifying phonemes in which the phoneme sequence includes a phoneme non-empty speech frame, e.g. N₃、N₄And N₉₉₉And waiting for the phonemes corresponding to the non-empty speech frames.

The computer device may filter at N₀To N₁₀₀₀A phoneme blank speech frame in each speech frame is reserved, phoneme non-empty speech frames are reserved to obtain a phoneme non-empty speech frame sequence, decoding search is carried out based on the possible degree distribution of various phonemes on the phoneme non-empty speech frames, for example, various phonemes are on the N₃、N₄、N₉₉₅、N₉₉₆、N₉₉₈And N₉₉₉Performing decoding search shown in fig. 3 according to the possible degree distribution on each phoneme non-empty speech frame to obtain phonemes corresponding to the phoneme non-empty speech frame, and forming an identification phoneme sequence; due to thisThe stage of decoding search is for phoneme non-empty speech frames and does not include phoneme blank speech frames, and therefore this stage of decoding search is referred to as a non-frame-by-frame decoding search.

Step S408, the phoneme sequence corresponding to the target detection word selected from the recognized phoneme sequence is used as a detection word phoneme sequence, and the extracted voice frame sequence corresponding to the target detection word in the voice to be detected is determined based on the position of the voice frame sequence corresponding to the detection word phoneme sequence in the voice frame sequence of the voice to be detected.

The target detection word is a keyword, and a voice operation instruction can be used as the target detection word, for example, the voice operation instruction of turning on an air conditioner is used as the target detection word; the awakening word can also be used as a target detection word, such as "Xiaoming". The detection morpheme sequence is a phoneme sequence corresponding to the target detection word, for example, if the target detection word is "turn on the air conditioner", the detection morpheme sequence is "d a k ai k ong t iao".

The extraction of the speech frame sequence refers to a plurality of continuous speech frames extracted from the speech to be detected, wherein the plurality of continuous speech frames comprise a phoneme non-empty speech frame sequence corresponding to the detection word phoneme sequence and a phoneme blank speech frame adjacent to the phoneme non-empty speech frame sequence corresponding to the detection word phoneme sequence.

After the computer equipment obtains the recognition phoneme sequence formed by the phonemes corresponding to the phoneme non-empty speech frame, if the recognition phoneme sequence is determined to comprise the detection morpheme sequence'd a k ai k ong t iao', the speech frame sequence corresponding to the detection morpheme sequence can be determined to be N₁₀₀To N₁₅₀A sequence of phoneme non-empty speech frames in between.

Then, based on N₁₀₀To N₁₅₀Sequence of phoneme non-empty speech frames in between N of speech to be detected₀To N₁₀₀₀Can determine the position in the speech frame sequence of N₁₀₀The preceding continuous speech frames may have phonemes of the target detection word or be located at N₁₅₀The phoneme of the target detection word may exist in a plurality of continuous voice frames; thus, may be located at N₁₀₀Before oneOf a plurality of successive speech frames, N₁₀₀To N₁₅₀Each speech frame in between and located at N₁₅₀And then, taking the continuous multiple voice frames as extraction voice frames to obtain an extraction voice frame sequence.

Step S410, obtaining phoneme probability distribution corresponding to each extracted voice frame in the extracted voice frame sequence.

The phoneme probability distribution corresponding to the extracted speech frame refers to the probability distribution of each class of phonemes in the extracted speech frame, for example, if the extracted speech frame is N₁₀₀Then the phonemes "a", "o", "e" and "d" are in N₁₀₀The probability of existence is N₁₀₀Distribution of likelihoods. And extracting phoneme possibility distribution corresponding to the voice frame, and respectively carrying out voice frame phoneme detection (namely phoneme type detection) according to the feature vector corresponding to the extracted voice frame to obtain the phoneme possibility distribution.

In some embodiments, a phoneme class detection model may be utilized to determine the likelihood of each class of phonemes existing in each extracted speech frame, so as to obtain a phoneme likelihood distribution of each extracted speech frame. The labels used in training the phoneme type detection model may include labels of various phonemes but not a label representing a phoneme blank, so that the phoneme type detection model may be used to forcibly predict phonemes corresponding to each speech frame, and a speech frame may not be predicted as a phoneme blank.

After the computer equipment detects the phonemes of the voice frames according to the extracted feature vectors corresponding to the voice frames, the phoneme probability distribution corresponding to each voice frame can be obtained.

Step S412, based on the phoneme probability distribution corresponding to the extracted voice frame, a voice detection result corresponding to the voice to be detected is obtained.

The computer device obtains the phoneme probability distribution of each voice frame in each extracted voice frame sequence, then carries out decoding search based on the phoneme probability distribution of each extracted voice frame, and determines the phoneme corresponding to each extracted voice frame. It should be noted that the decoding search at this stage is performed for each continuous speech frame, and therefore, the decoding search at this stage is referred to as a frame-by-frame decoding search.

The above process of frame-by-frame decoding search is described with reference to fig. 2: if the extracted speech frame sequence is N₈₀To N₁₈₀Speech frame sequence in between, then the computer device may be based on N₈₀To N₁₈₀The phoneme likelihood distribution of each speech frame in the sequence of speech frames therebetween and the phoneme transition likelihood between adjacent speech frames (the transition likelihood can be given by a language model) can be constructed into a speech decoding network (also called a decoding graph) shown in fig. 2, then decoding and searching are performed on the speech decoding network to obtain a maximum probability path, and the phoneme located in the maximum probability path is used as the phoneme corresponding to the speech frame, for example, if the first phoneme of the maximum probability path is "d", the phoneme "d" can be used as the phoneme corresponding to N₈₀A corresponding phoneme; likewise, type processing may be performed for other phonemes of the most probable path to determine N₈₀And N₁₈₀The phoneme corresponding to each voice frame between the target detection words accurately determines the position of the target detection words in the voice, and the forced alignment of the voice is realized.

In the voice detection method, because the voice frame of the phoneme blank is possibly filtered based on the candidate phoneme corresponding to the voice frame, the recognized phoneme sequence is obtained based on the candidate phoneme possibility corresponding to the voice frame in which the phoneme exists, the subsequent processing of the voice frame of the phoneme blank is not needed, the acquisition efficiency of the recognized phoneme sequence is improved, and the efficiency of acquiring the phoneme sequence corresponding to the target detection word from the recognized phoneme sequence is also improved; then, based on the position of the voice frame sequence corresponding to the phoneme sequence of the detection word in the voice frame sequence of the voice to be detected, determining an extracted voice frame sequence corresponding to the target detection word, and subsequently, performing phoneme detection on all voice frames is not needed, so that the efficiency is improved; moreover, since frame-by-frame phoneme detection is performed on the extracted voice frame sequence, the position of the keyword in the voice can be accurately reflected according to the voice detection result obtained by the phoneme probability distribution of each extracted voice frame obtained by the frame-by-frame phoneme detection, and the accuracy is improved.

In some embodiments, the step S408 may include the following steps: determining the initial position and the end position of a voice frame sequence corresponding to the detection morpheme sequence in the voice to be detected; taking a forward position corresponding to the starting position as a first extraction position, and taking a backward position corresponding to the ending position as a second extraction position; and taking a voice frame sequence between the first extraction position and the second extraction position in the voice to be detected as an extracted voice frame sequence corresponding to the target detection word.

The start position may be understood as detecting a frame time corresponding to a first speech frame in a sequence of speech frames corresponding to a morpheme sequence in the speech to be detected, and the end position may be understood as detecting a frame time corresponding to a last speech frame in the sequence of speech frames corresponding to the morpheme sequence in the speech to be detected.

The first extraction position is a forward position of the starting position, namely the frame time of a first speech frame in a speech frame sequence corresponding to a detection word phoneme sequence in the speech to be detected; the second extraction position is a backward position of the termination position, that is, a frame time of a last speech frame in the sequence of speech frames corresponding to the phoneme sequence of the detection word in the speech to be detected.

Illustratively, if the sequence of speech frames corresponding to the sequence of detected morphemes is N₁₀₀To N₁₅₀A sequence of phoneme non-empty speech frames in between, wherein N₁₀₀For the first speech frame in the sequence of speech frames corresponding to the sequence of detected morphemes, N₁₅₀The last speech frame in the speech frame sequence corresponding to the detected morpheme sequence; can be earlier than N₁₀₀As a first extraction position, the frame time of (e.g., 80 th frame time); may also be later than N₁₅₀E.g. the 180 th frame time, as the second extraction position, and will connect to N₈₀To N₁₀₀And taking each voice frame in the interval as an extracted voice frame to obtain an extracted voice frame sequence.

In the above manner, the corresponding extraction position is determined based on the initial position and the end position of the voice frame sequence corresponding to the detection word phoneme sequence at the voice to be detected, so that the accuracy of the obtained extracted voice frame sequence is ensured while the phoneme detection efficiency is considered.

In some embodiments, the first extraction position may also be a frame time of a phoneme non-empty speech frame in the speech to be detected, which is earlier than and adjacent to a first speech frame in the sequence of speech frames corresponding to the sequence of detected morphemes; the number of phoneme non-empty voice frames between the frame time corresponding to the first extraction position and the frame time between the first voice frames in the voice frame sequence corresponding to the detection morpheme sequence is less than or equal to a preset value.

Exemplarily, in the speech to be detected, after performing non-frame-by-frame decoding search for phoneme non-empty speech frames, it is determined that the speech frame sequence corresponding to the detected morpheme sequence is N₁₁₀₀To N₁₁₅₀A sequence of phoneme non-empty speech frames in between, wherein N₁₁₀₀Is the first speech frame in the sequence of speech frames corresponding to the sequence of detected morphemes.

If the preset value is 1 and in the phoneme non-empty speech frame sequence of the speech to be detected, N₁₁₀₀Is N₁₀₇₉，N₁₀₇₉To N₁₁₀₀The speech frames in between are phoneme blank speech frames, N₁₀₇₉Corresponding 1079 th frame time and N₁₁₀₀The number of phoneme non-empty speech frames between 1100 th frame time is 0, which is less than the preset value, then N can be considered as₁₀₇₉The corresponding 1079 th frame instant is earlier and adjacent to N₁₁₀₀And the 1079 th frame time is taken as the first extraction position.

If the preset value is 1 and in the phoneme non-empty speech frame sequence of the speech to be detected, N₁₁₀₀The first three phoneme non-null speech frames of (1) are N₁₀₆₈、N₁₀₇₀And N₁₀₇₉，N₁₀₆₈To N₁₀₇₀Speech frame between, N₁₀₇₀To N₁₀₇₉Speech frame in between, and N₁₀₇₉To N₁₁₀₀The voice frames in between are phoneme blank voice frames; n is a radical of₁₀₆₈Corresponding 1068 th frame time and N₁₁₀₀The corresponding phoneme non-null speech frame between 1100 th frame time is N₁₀₇₀And N₁₀₇₉The number of the corresponding phoneme non-empty speeches is 2, which is greater than the preset value, at this time, N may not be used₁₀₆₈Corresponding 1068 th frame time as being adjacent to N₁₁₀₀Is a frame ofThe time of day. N is a radical of₁₀₇₀Corresponding 1070 th frame time and N₁₁₀₀The corresponding phoneme non-null speech frame between 1100 th frame time is N₁₀₇₉The number of the corresponding phoneme non-empty speech is 1, which is just equal to the preset value, at this time, N may be set₁₀₇₀Corresponding 1070 th frame instant as being adjacent to N₁₁₀₀The frame time of (c). N is a radical of₁₀₇₉Corresponding 1079 th frame time and N₁₁₀₀No phoneme non-empty speech frame exists between corresponding 1100 th frame moments, the number of the corresponding phoneme non-empty speech frames is 0 and is smaller than a preset value, and at the moment, N can be used₁₀₇₉Corresponding 1079 th frame instant as being adjacent to N₁₁₀₀The frame time of (c).

In some embodiments, if there are a plurality of frame times of a phoneme non-null speech frame in the speech to be detected that is earlier than and adjacent to the first speech frame in the sequence of speech frames corresponding to the sequence of detected morphemes, one frame time may be selected as the first extraction position. For example, the frame time of the phoneme non-empty speech frame earlier than and adjacent to the first speech frame in the sequence of speech frames corresponding to the sequence of detected morphemes in the speech to be detected is N₁₀₇₀And N₁₀₇₉At this time, one frame time may be selected as the first extraction position.

In the above manner, the frame time of the phoneme non-empty speech frame in the speech to be detected, which is earlier than and adjacent to the first speech frame in the sequence of the speech frames corresponding to the detected morpheme sequence, is taken as the first extraction position, and the number of the phoneme non-empty speech frames between the frame time corresponding to the first extraction position and the frame time between the first speech frames in the sequence of the speech frames corresponding to the detected morpheme sequence is less than or equal to the preset value, so that the frame time corresponding to the phoneme non-empty speech frame farther from the first speech frame can be prevented from being taken as the first extraction position, the extraction of the speech frames with excessive number can be avoided, and the complexity of frame-by-frame decoding search can be reduced.

In some embodiments, the second extraction position may also be a frame time of a phoneme non-empty speech frame in the speech to be detected, which is later than and adjacent to a last speech frame in the sequence of speech frames corresponding to the sequence of detected morphemes; the number of phoneme non-empty voice frames between the frame time corresponding to the second extraction position and the frame time between the last voice frames in the voice frame sequence corresponding to the detection morpheme sequence is less than or equal to a preset value.

Exemplarily, in the speech to be detected, after performing non-frame-by-frame decoding search for phoneme non-empty speech frames, it is determined that the speech frame sequence corresponding to the detected morpheme sequence is N₁₁₀₀To N₁₁₅₀A sequence of phoneme non-empty speech frames in between, wherein N₁₅₀₀The last speech frame in the sequence of speech frames corresponding to the sequence of detected morphemes.

If the preset value is 1 and in the phoneme non-empty speech frame sequence of the speech to be detected, N₁₅₀₀Is N₁₅₅₀，N₁₅₀₀To N₁₅₅₀The speech frames in between are phoneme blank speech frames, N₁₅₀₀Corresponding 1500 th frame time and N₁₅₅₀The number of phoneme non-empty speech frames between the 1550 th frame time is 0, which is less than the predetermined value, then it can be considered that N is₁₅₅₀The corresponding 1550 th frame time is later than and adjacent to N₁₅₅₀And the 1550 th frame time is taken as the second extraction position.

If the preset value is 1 and in the phoneme non-empty speech frame sequence of the speech to be detected, N₁₅₀₀The last three phoneme non-null speech frame of (1) is N₁₅₅₀、N₁₅₆₇And N₁₅₇₉，N₁₅₀₀To N₁₅₅₀Speech frame between, N₁₅₅₀To N₁₅₆₇Speech frame in between, and N₁₅₆₇To N₁₅₇₉The voice frames in between are phoneme blank voice frames; n is a radical of₁₅₀₀Corresponding 1500 th frame time and N₁₅₅₀No phoneme non-empty speech frame exists between the 1550 th frame time, the number of the corresponding phoneme non-empty speech frames is 0 and is less than a preset value, and at the moment, N can be set₁₅₅₀Corresponding 1550 th frame instant as being adjacent to N₁₅₀₀The frame time of (c). N is a radical of₁₅₀₀Corresponding 1500 th frame time and N₁₅₆₇The corresponding phoneme non-null speech frame between 1567 frame times is N₁₅₅₀The number of the corresponding phoneme non-empty speech is 1, which is just equal to the preset value, at this time, N may be set₁₅₆₇Corresponding 1567 th frameTime instant as being adjacent to N₁₅₀₀The frame time of (c). N is a radical of₁₅₀₀Corresponding 1500 th frame time and N₁₅₇₉The corresponding phoneme non-empty speech frame between the 1579 th frame instant is N₁₅₅₀And N₁₅₆₇The number of the corresponding phoneme non-empty speeches is 2, which is greater than the preset value, at this time, N may not be used₁₅₇₉Corresponding to the 1579 th frame time as being adjacent to N₁₅₀₀The frame time of (c).

In some embodiments, if there are a plurality of frame times of phoneme non-empty speech frames in the speech to be detected, which are later than and adjacent to the last speech frame in the sequence of speech frames corresponding to the sequence of detected morphemes, one frame time may be selected as the second extraction position. For example, the frame time of a phoneme non-empty speech frame in the speech to be detected, which is later than and adjacent to the last speech frame in the sequence of speech frames corresponding to the sequence of detected morphemes, is N₁₅₅₀And N₁₅₆₇At this time, one frame time may be selected as the second extraction position.

In the above manner, the number of phoneme non-empty speech frames between the frame time corresponding to the second extraction position and the frame time between the last speech frame in the sequence of speech frames corresponding to the detection morpheme sequence in the speech to be detected, which is later than and adjacent to the last speech frame in the sequence of speech frames corresponding to the detection morpheme sequence, is less than or equal to the preset value, so that the frame time corresponding to the phoneme non-empty speech frame farther from the last speech frame can be prevented from being used as the second extraction position, the extraction of the speech frames with excessive number can be avoided, and the complexity of frame-by-frame decoding search can be reduced.

In some embodiments, the step of taking the forward position corresponding to the start position as the first extraction position and taking the backward position corresponding to the end position as the second extraction position may specifically include: taking the number of the voice frames with phoneme blanks in the voice to be detected as the number of blank frames; determining the extension number of the voice frames based on the number of the blank frames; the extended number of the voice frames and the number of the blank frames form a positive correlation; and taking a forward position which is away from the starting position by the extension number of the voice frames as a first extraction position, and taking a backward position which is away from the ending position by the extension number of the voice frames as a second extraction position.

The number of blank frames refers to the number of speech frames with phoneme blanks between the frame time corresponding to the first speech frame and the frame time corresponding to the last speech frame, wherein the first speech frame is the first speech frame in the sequence of speech frames corresponding to the detected morpheme sequence, and the last speech frame is the last speech frame in the sequence of speech frames corresponding to the detected morpheme sequence. For example, the sequence of speech frames corresponding to the sequence of detected morphemes is N₁₁₀₀To N₁₁₅₀A sequence of phoneme non-empty speech frames in between, wherein N₁₁₀₀For the first speech frame in the sequence of speech frames corresponding to the sequence of detected morphemes, N₁₅₀₀The last speech frame in the speech frame sequence corresponding to the detected morpheme sequence; correspondingly, N₁₁₀₀Corresponding 1100 th frame time to N₁₁₅₀The number of phoneme blank speech frames between the corresponding 1500 th frame time is taken as the number of blank frames.

The extension number of the voice frames can be the number of the voice frames between the frame time corresponding to the first extraction position and the frame time corresponding to the first voice frame; the frame time corresponding to the first extraction position is earlier than that corresponding to the first speech frame, the speech frame corresponding to the first extraction position is earlier than the first speech frame, and the extraction of the speech frame is equivalent to that obtained by performing forward extension on the speech frame on the basis of the first speech frame, so that the number of the speech frames can be called as the extension number of the speech frames.

The extension number of the voice frames can also be the number of the voice frames between the frame time corresponding to the second extraction position and the frame time corresponding to the last voice frame; the frame time corresponding to the second extraction position is later than the frame time corresponding to the last speech frame, the speech frame corresponding to the second extraction position is later than the last speech frame, and the extraction of the speech frame is equivalent to the speech frame backward extension on the basis of the last speech frame, so that the number of the speech frames can be called as the number of the speech frame extensions.

The extended number of the voice frames and the number of the blank frames form a positive correlation relationship, and the larger the number of the blank frames is, the larger the extended number of the voice frames is. For example, if the number of blank frames is 5 and 10, respectively, according to the positive correlation between the number of blank frames and the extension number of speech frames, the number of speech frames corresponding to the number of blank frames of 5 is smaller than the number of speech frames corresponding to the number of blank frames of 10.

Illustratively, if the speech frame extension number is 10, and the speech frame sequence corresponding to the detected morphological sequence is N₁₀₀To N₁₅₀A sequence of phoneme non-empty speech frames in between, wherein N₁₀₀For the first speech frame in the sequence of speech frames corresponding to the sequence of detected morphemes, N₁₅₀The last speech frame in the speech frame sequence corresponding to the detected morpheme sequence; the forward position with the extended number of speech frames from the start position can be taken as the first extraction position, i.e. N₁₀₀The previous 10 th speech frame N₉₀The corresponding position is taken as a forward position with the distance from the initial position being the extended number of the voice frames, and N is added₉₀The corresponding position is used as a first extraction position; it is also possible to use the backward positions with an extended number of speech frames at a distance from the termination position as the second extraction positions, i.e. N₁₅₀The 10 th speech frame N thereafter₁₆₀The corresponding position is taken as a backward position with the distance from the termination position being the extended number of the voice frame, and N is used₁₆₀The corresponding position serves as a second extraction position.

In the above mode, the more the number of blank frames, the more the "sequence-to-sequence" model cannot judge more speech frames, so the more the number of speech frame extensions, the more phoneme blank speech frames are forcibly given with the phoneme type labels, and the accuracy of the speech forced alignment is improved.

In some embodiments, the candidate phone likelihoods include likelihoods corresponding to phone whitespaces; the step S406 includes: taking the voice frame with the possibility degree corresponding to the phoneme blank larger than the possibility degree threshold value as a voice frame of the phoneme blank; filtering the voice frame of the phoneme blank to obtain a decoded voice frame, and acquiring a voice decoding network formed based on the candidate phoneme possibility corresponding to the decoded voice frame; decoding is carried out based on a voice decoding network to obtain a target decoding path, and phonemes passed by the target decoding path are arranged according to a path sequence to obtain a recognition phoneme sequence corresponding to the voice to be detected.

The decoding speech frame is a phoneme non-empty speech frame, and in the non-frame-by-frame decoding search stage, only the phoneme non-empty speech frame is subjected to decoding search, but the phoneme blank speech frame is not subjected to decryption search, so that the phoneme non-empty speech frame can be called a decoding speech frame.

The candidate phoneme likeliness corresponding to the decoded speech frame may be output by a "sequence-to-sequence" model, and if the candidate phoneme likeliness is output by the "sequence-to-sequence" model, the candidate phoneme likeliness may include a likeliness that various phonemes exist in the decoded speech frame and a likeliness that the decoded speech frame is a phoneme blank speech frame.

Based on the candidate phoneme likelihoods corresponding to the decoded speech frames, a speech decoding network as shown in fig. 3 may be formed, and the probability of the corresponding directional edge between the states connecting different decoded speech frames in the speech decoding network may be given by the language model. The target decoding path is the path with the highest probability in the speech decoding network.

After the computer device forms the speech decoding network shown in fig. 3 based on the candidate phoneme possibility corresponding to the decoded speech frame, the computer device may perform decoding search on the speech decoding network, use the maximum probability path as the target decoding path, and arrange the phonemes passed by the target decoding path according to the path sequence to obtain the recognition phoneme sequence corresponding to the speech to be detected.

In the above manner, the speech decoding network is constructed according to the candidate phoneme possibility of the decoded speech frame obtained by filtering the phoneme blank speech frame, so that the decoding complexity is reduced, and the recognition efficiency of the target detection word is improved.

In some embodiments, the step of extracting features of each speech frame in the speech to be detected to obtain a feature vector corresponding to the speech frame may specifically include: and inputting each voice frame in the voice to be detected into the feature extraction submodel of the trained voice detection model for feature extraction to obtain a feature vector corresponding to the voice frame. The step of performing phoneme recognition based on the feature vector sequence to obtain the candidate phoneme likeliness corresponding to each speech frame may specifically include: and inputting the characteristic vector sequence into a speech sequence recognition sub-model for phoneme recognition to obtain candidate phoneme likeliness corresponding to each speech frame. The step of obtaining the phoneme likelihood distribution corresponding to each extracted speech frame in the extracted speech frame sequence may specifically include: extracting the feature vector corresponding to the extracted voice frame from the feature vector sequence, and inputting the feature vector corresponding to the extracted voice frame into a voice frame phoneme detection sub-model for phoneme recognition; and taking the possible degree distribution of each class of phoneme output by the voice frame phoneme detection submodel in each extracted voice frame as the possible degree distribution of the phoneme corresponding to each extracted voice frame.

As shown in fig. 5, the trained speech detection model includes a feature extraction submodel, and a speech sequence recognition submodel and a speech frame phoneme detection submodel respectively connected to the feature extraction submodel.

The feature extraction submodel is used for extracting features of the voice frame to obtain a feature vector of the voice frame.

The speech sequence recognition submodel is a "sequence-to-sequence" model, and the speech frame phoneme detection submodel is a phoneme class detection model, and the training of the two models is described as follows: the labels used in training the sequence-to-sequence model comprise labels representing phoneme blanks, and for the speech frames which are difficult to predict and correspond to phonemes, the sequence-to-sequence model marks the phoneme blanks without forcibly endowing the speech frames with labels of certain types of phonemes so as to reduce the model loss value; the labels used in training the phoneme type detection model may include labels of various phonemes but not labels representing phoneme blanks, so that the phoneme type detection model may be used to forcibly predict phonemes corresponding to each speech frame, and a speech frame may not be predicted as a speech frame with phoneme blanks. For other contents of the two models, i.e., the "sequence-to-sequence" model and the phoneme type detection model, reference may be made to the above description, which is not repeated herein.

Inputting each voice frame in the voice to be detected into a feature extraction sub-model of a trained voice detection model by the computer equipment for feature extraction to obtain a feature vector corresponding to the voice frame; and the computer equipment can also input the characteristic vector sequence into the speech sequence recognition submodel for phoneme recognition to obtain candidate phoneme likeliness corresponding to each speech frame. The computer equipment can extract the feature vector corresponding to the extracted voice frame from the feature vector sequence and input the feature vector corresponding to the extracted voice frame into the voice frame phoneme detection sub-model for phoneme recognition; and taking the possible degree distribution of each class of phoneme output by the voice frame phoneme detection submodel in each extracted voice frame as the possible degree distribution of the phoneme corresponding to each extracted voice frame.

In the above mode, the same feature extraction submodel is used to extract features of the speech frame to obtain feature vectors of the speech frame, and the sequence-to-sequence model and the phoneme type detection model are used to perform phoneme processing on the feature vectors of the speech frame, so as to improve the processing efficiency.

In some embodiments, the step of obtaining the trained speech detection model may specifically include: acquiring a training voice; inputting training voice into a feature extraction submodel to be trained for feature extraction to obtain training feature vectors corresponding to training voice frames, and sequencing the training feature vectors corresponding to the training voice frames according to the corresponding voice frame sequence to obtain a training vector sequence; inputting the training vector sequence into a voice sequence recognition sub-model to be trained for phoneme recognition to obtain a phoneme sequence recognition result; obtaining a first model loss value based on a first difference between the phoneme sequence recognition result and a standard recognition result corresponding to the training speech; the first model loss value is in positive correlation with the first difference; inputting each training feature vector into a voice frame phoneme detection submodel to be trained for phoneme recognition to obtain a voice frame phoneme detection result corresponding to the voice frame phoneme detection submodel; obtaining a second model loss value based on a second difference between the speech frame phoneme detection result and a standard recognition result corresponding to the training speech; the first model loss value is in positive correlation with the second difference; obtaining a target model loss value based on the first model loss value and the second model loss value; and performing parameter adjustment on the feature extraction submodel to be trained, the voice frame phoneme detection submodel to be trained and the voice sequence identification submodel to be trained based on the target model loss value, and forming the trained voice detection model by using the feature extraction submodel after parameter adjustment, the voice frame phoneme detection submodel after parameter adjustment and the voice sequence identification submodel after parameter adjustment.

When the speech sequence recognition submodel is trained, the input used may be a training vector sequence corresponding to a speech frame sequence, and the output used may be a training phoneme sequence corresponding to the speech frame sequence, where the training vector sequence and the training phoneme sequence do not need to be in one-to-one correspondence in time, that is, a corresponding phoneme label does not need to be assigned to each speech frame of the speech frame sequence.

The phoneme sequence recognition result obtained by performing phoneme recognition on the to-be-trained speech sequence recognition submodel is a phoneme sequence existing in the training speech; the standard recognition result of the training speech to be compared with the phoneme sequence recognition result is a training phoneme sequence, and the first difference represents a degree of difference between the phoneme sequence output by the speech sequence recognition submodel and the training phoneme.

When training the voice frame phoneme detection submodel, the training vector sequence and the training phoneme sequence need to be in one-to-one correspondence in time, that is, a corresponding phoneme label needs to be given to each voice frame of the voice frame sequence; the loss function of the speech frame phoneme detection submodel may be a CE function.

The method comprises the steps that a voice frame phoneme detection sub-model to be trained carries out phoneme recognition, and a voice frame phoneme detection result obtained by the phoneme recognition is a predicted phoneme label given to each training voice frame; the standard recognition result of the training speech compared with the speech frame phoneme detection result is the real phoneme label of each training speech frame, and the second difference represents the difference degree between the speech frame phoneme detection result output by the speech frame phoneme detection submodel and the real phoneme label of the training speech frame.

The target model loss value may be a sum of the first model loss value and the second model loss value; the target model loss value may be a sum of the first model loss value and the weighted second model loss value, for example, if the speech sequence recognition submodel is a Transducer (Transporter), the speech sequence recognition submodel may be a TransporterThe frame detection submodel is a model with the CE function as a loss function, and the target model loss value may be L ═ L_Transducer+αL_CEAnd α represents a weight given to the loss value of the second model.

The computer equipment inputs training voice into a feature extraction submodel to be trained for feature extraction to obtain training feature vectors corresponding to the training voice frames, and sequences the training feature vectors corresponding to the training voice frames according to the corresponding voice frame sequence to obtain a training vector sequence; inputting the training vector sequence into a voice sequence recognition sub-model to be trained for phoneme recognition to obtain a phoneme sequence recognition result; obtaining a first model loss value based on a first difference between the phoneme sequence recognition result and a standard recognition result corresponding to the training speech; the first model loss value is in positive correlation with the first difference; inputting each training feature vector into a voice frame phoneme detection submodel to be trained for phoneme recognition to obtain a voice frame phoneme detection result corresponding to the voice frame phoneme detection submodel; obtaining a second model loss value based on a second difference between the speech frame phoneme detection result and a standard recognition result corresponding to the training speech; the first model loss value is in positive correlation with the second difference; obtaining a target model loss value based on the first model loss value and the second model loss value; and performing parameter adjustment on the feature extraction submodel to be trained, the voice frame phoneme detection submodel to be trained and the voice sequence identification submodel to be trained based on the target model loss value, and forming the trained voice detection model by using the feature extraction submodel after parameter adjustment, the voice frame phoneme detection submodel after parameter adjustment and the voice sequence identification submodel after parameter adjustment.

In the above mode, the training vector sequence is used to respectively construct the speech sequence recognition submodel and the speech frame detection submodel, the model loss value is determined according to the difference between the corresponding prediction result and the standard recognition result, and the parameters of the speech sequence recognition submodel and the speech frame detection submodel are adjusted according to the target model loss value determined by the first model loss value and the second model loss value, so as to improve the accuracy of model prediction.

In some embodiments, the step of inputting the feature vector sequence into the speech sequence recognition submodel to perform phoneme recognition to obtain the candidate phoneme likeliness corresponding to each speech frame may specifically include: sequentially acquiring current eigenvectors in the eigenvector sequence according to the order of the eigenvector sequence; acquiring a phoneme expression vector corresponding to a previous feature vector corresponding to a current feature vector; and inputting the phoneme expression vector corresponding to the previous feature vector and the current feature vector into a speech sequence recognition sub-model for phoneme recognition to obtain the candidate phoneme possibility corresponding to the current speech frame corresponding to the current feature vector.

And the phoneme representation vector corresponding to the previous feature vector is obtained by inputting the previous feature vector into the speech sequence recognition submodel for phoneme recognition.

The feature extraction sub-model may be constructed based on an Encoder, the speech frame phoneme detection sub-model may be constructed based on a Linear layer, and the speech sequence recognition sub-model may be constructed based on a Predictor and a Joint layer, so as to obtain the model shown in fig. 6. The Encoder may be a recurrent neural network, such as LSTM (Long Short-Term Memory), which accepts an audio feature input at time t and outputs an acoustic hidden layer representation, or may be an FSMN-based network. Predictor may be a recurrent neural network, such as LSTM, that accepts the non-empty output labels of the previous step model

The output text hidden layer represents that the Predictor can also be a CNN network (Convolutional Neural Networks), so that the calculation efficiency is improved. The Joint can be a fully-connected neural network, such as a linear layer + activation unit, the acoustic representation and the text representation are subjected to linear transformation and then summed, a hidden unit representation is output, and finally the hidden unit representation is converted into probability distribution through softmax.

The computer equipment sequentially obtains current feature vectors in the feature vector sequence according to the sequence of the feature vector sequence, and then obtains phoneme expression vectors corresponding to previous feature vectors corresponding to the current feature vectors; and inputting the phoneme expression vector corresponding to the previous feature vector and the current feature vector into a speech sequence recognition sub-model for phoneme recognition to obtain the candidate phoneme possibility corresponding to the current speech frame corresponding to the current feature vector.

In the above manner, the phoneme representation vector corresponding to the previous feature vector and the current feature vector are used for phoneme recognition, so that the prediction accuracy of the candidate phoneme possibility corresponding to the current speech frame corresponding to the current feature vector can be improved.

In some embodiments, the step of obtaining the phoneme representation vector corresponding to the previous feature vector comprises: inputting the previous feature vector into a speech sequence recognition sub-model for phoneme recognition to obtain candidate phoneme likeliness corresponding to the previous feature vector; selecting the candidate phoneme with the maximum corresponding possibility from the candidate phoneme set as a target phoneme; and taking the phoneme representation vector corresponding to the target phoneme as the phoneme representation vector corresponding to the previous feature vector.

Wherein the candidate phoneme likelihood corresponding to the previous feature vector includes the likelihood corresponding to each candidate phoneme in the candidate phoneme set, and the phoneme expression vector corresponding to the target phoneme includes the accumulated phonemes of a plurality of feature vectors, as shown in fig. 6

After inputting the previous feature vector into the speech sequence recognition submodel for phoneme recognition, the computer device takes the candidate phoneme possibility output by the speech sequence recognition submodel as the candidate phoneme possibility of the previous feature vector, selects the candidate phoneme with the maximum possibility from the candidate phoneme set as the target phoneme, and takes the phoneme expression vector corresponding to the target phoneme as the phoneme expression vector corresponding to the previous feature vector.

In the above manner, the candidate phoneme with the highest probability in the candidate phoneme set is used as the target phoneme, and the phoneme expression vector corresponding to the previous feature vector is obtained, so that the prediction accuracy of the current feature vector is improved.

In some embodiments, after inputting the phoneme representation vector corresponding to the previous feature vector and the current feature vector into the speech sequence recognition submodel for phoneme recognition, and obtaining the candidate phoneme likelihood corresponding to the current speech frame corresponding to the current feature vector, the computer device may further perform the following steps: and when the phoneme recognition of the feature vectors in the feature vector sequence is not finished, returning to the step of sequentially acquiring the current feature vectors in the feature vector sequence according to the sequence of the feature vector sequence so as to perform phoneme recognition on the feature vectors which are not subjected to the phoneme recognition until the phoneme recognition of the feature vectors in the feature vector sequence is finished.

If some feature vectors in the feature vector sequence are not input into the speech sequence recognition submodel for phoneme recognition, the feature vectors in the feature vector sequence are not subjected to phoneme recognition, and at this time, the feature vectors which are not subjected to phoneme recognition can be used as the current feature vectors, and the current feature vectors are processed according to the processing mode of the current feature vectors until all the feature vectors in the feature vector sequence are subjected to phoneme recognition.

In the above manner, if the feature vectors in the feature vector sequence are not subjected to phoneme recognition, the feature vectors which are not subjected to phoneme recognition are used as the current feature vectors to perform phoneme recognition, so that the feature vectors in the feature vector sequence are guaranteed to be recognized completely, omission is avoided, and accuracy of forced speech alignment is improved.

In some embodiments, the step of obtaining the speech detection result corresponding to the speech to be detected based on extracting the phoneme likelihood distribution corresponding to the speech frame may specifically include: obtaining the detection word position of the target detection word in the voice to be detected based on the phoneme probability distribution corresponding to the extracted voice frame; acquiring the voice located at the backward position corresponding to the position of the detection word in the voice to be detected as instruction detection voice; and carrying out voice instruction detection on the instruction detection voice to obtain a target voice instruction corresponding to the instruction detection voice so as to carry out voice control based on the target voice instruction.

The target detection word can be a wake-up word, and the computer device can acquire voice corresponding to the wake-up word sent by the user, detect the voice instruction and determine the voice instruction sent by the user. The target voice instruction may be a voice manipulation instruction such as "turn on air conditioner" and "send message" or the like.

The detection word position is the position of the target detection word in the speech, and can be represented as the time correspondence between the phoneme sequence of the target detection word and the speech frame sequence. The instruction detection speech is a speech to be detected in which a target speech instruction may exist.

After the computer equipment determines the position of a target detection word in the voice in a frame-by-frame decoding search mode, taking the voice at the backward position after the position of the detection word as an instruction detection voice, and determining whether the instruction detection voice Chinese comprises a target voice instruction or not; and if the computer equipment determines that the instruction detection voice comprises the target voice instruction, performing voice control according to the target voice instruction, such as turning on an air conditioner and the like.

In the above mode, the voice of the backward position corresponding to the position of the target detection word in the voice to be detected is used as the instruction detection voice, so that other invalid voices are prevented from being detected, and the energy consumption of the equipment is reduced.

The application also provides a voice detection method, which mainly comprises the following steps:

step S1, inputting each voice frame in the voice to be detected into the feature extraction submodel of the trained voice detection model for feature extraction to obtain a feature vector corresponding to the voice frame;

step S2, the eigenvectors corresponding to each voice frame are sequenced according to the corresponding voice frame sequence to obtain the eigenvector sequence

Step S3, inputting the characteristic vector sequence into a speech sequence recognition submodel for phoneme recognition to obtain candidate phoneme likeliness corresponding to each speech frame;

step S4, using the speech frame with the phoneme blank corresponding to the likelihood degree larger than the likelihood threshold as the speech frame of the phoneme blank;

step S5, filtering the speech frame of phoneme blank to obtain a decoded speech frame, and acquiring a speech decoding network formed based on the candidate phoneme possibility corresponding to the decoded speech frame;

step S6, decoding based on the voice decoding network to obtain a target decoding path, and arranging the phonemes passed by the target decoding path according to the path sequence to obtain a recognition phoneme sequence corresponding to the voice to be detected;

step S7, using the phoneme sequence corresponding to the target detection word selected from the recognized phoneme sequence as the detection word phoneme sequence;

step S8, determining the initial position and the end position of the voice frame sequence corresponding to the detection morpheme sequence in the voice to be detected;

step S9, taking the number of the speech frames with phoneme blanks in the speech to be detected as the number of blank frames;

step S10, determining the extension number of the voice frames based on the number of the blank frames; the extended number of the voice frames and the number of the blank frames form a positive correlation;

step S11, using the forward position with the distance from the initial position as the extension number of the voice frame as the first extraction position, and using the backward position with the distance from the terminal position as the extension number of the voice frame as the second extraction position;

step S12, taking a voice frame sequence between a first extraction position and a second extraction position in the voice to be detected as an extraction voice frame sequence corresponding to the target detection word;

step S13, extracting the feature vector corresponding to the extracted voice frame from the feature vector sequence, and inputting the feature vector corresponding to the extracted voice frame into the voice frame phoneme detection submodel for phoneme recognition;

step S14, the possible degree distribution of each class of phoneme output by the voice frame phoneme detection submodel in each extracted voice frame is used as the phoneme possible degree distribution corresponding to each extracted voice frame;

step S15, obtaining the detection word position of the target detection word in the voice to be detected based on the phoneme probability distribution corresponding to the extracted voice frame;

step S16, acquiring the voice located at the backward position corresponding to the position of the detection word in the voice to be detected as the instruction detection voice;

step S17, performing voice command detection on the command detection voice to obtain a target voice command corresponding to the command detection voice, so as to perform voice control based on the target voice command.

In this embodiment, since the phoneme blank speech frame may be filtered based on the candidate phonemes corresponding to the speech frame, and the recognized phoneme sequence is obtained based on the candidate phoneme possibility corresponding to the speech frame in which the phoneme exists, subsequent processing on the phoneme blank speech frame is not required, so that the obtaining efficiency of the recognized phoneme sequence is improved, and the obtaining efficiency of the phoneme sequence corresponding to the target detection word from the recognized phoneme sequence is also improved; then, based on the position of the voice frame sequence corresponding to the phoneme sequence of the detection word in the voice frame sequence of the voice to be detected, determining an extracted voice frame sequence corresponding to the target detection word, and subsequently, performing phoneme detection on all voice frames is not needed, so that the efficiency is improved; moreover, since frame-by-frame phoneme detection is performed on the extracted voice frame sequence, the position of the keyword in the voice can be accurately reflected according to the voice detection result obtained by the phoneme probability distribution of each extracted voice frame obtained by the frame-by-frame phoneme detection, and the accuracy is improved.

In order to better understand the above method, an application example of the speech detection method of the present application is described in detail below. The speech sequence identifier model of the application example is a Transducer model and belongs to a keyword (namely target detection word) detection system based on the Transducer model. In the application example, the model training of the keyword detection system adopts a multi-task learning (MTL) strategy, that is, the Loss value of the Loss function (driver Loss) of the driver model and the Loss value of the CE Loss function are minimized at the same time, and the accuracy and the efficiency are both considered under the condition of ensuring the accuracy and the real-time performance of the keyword detection system.

The application example mainly comprises the following steps:

(1) the method adopts an MTL strategy, optimizes Transducer Loss and CE Loss simultaneously, and realizes sequence-to-sequence modeling and frame-to-frame corresponding modeling;

(2) and combining the line type search graph and the frame-by-frame phoneme posterior probability, efficiently completing the alignment of the speech and the text, and extracting the position of the keyword.

The scheme provided by the application example has the following characteristics:

(1) the output of the acoustic model has actual pronunciation significance by adopting a modeling unit with larger granularity, namely phonemes;

(2) because the output of the Transducer has the concept of 'null output', when decoding, the speech frame which is output as null directly skips decoding, thereby improving the decoding efficiency;

(3) the training process is simple, forced alignment is an optional process, and the temporal corresponding relation between voice and text is not needed in the training of the Transducer.

The application example can be applied to a vehicle-mounted offline voice self-defined awakening system. The user-defined awakening has an important position in the vehicle-mounted voice assistant, and meanwhile, the accuracy of keyword detection is higher. Through self-defined awakening, some quick tasks and control instructions in the voice assistant, such as sending and cancelling of information and control of vehicle-mounted equipment, can be quickly realized. The whole flow of the application example is shown in fig. 7, and comprises the following steps:

step S702, the vehicle-mounted voice awakening system collects voice (namely audio) in the vehicle in real time;

step S704, the vehicle-mounted voice awakening system carries out framing processing on the voice to obtain a plurality of voice frames and extracts the characteristics of each voice frame;

step S706, the vehicle-mounted voice awakening system performs acoustic calculation on each voice frame by using the Transducer to obtain an acoustic score output by the Transducer, wherein the acoustic score comprises the probability that the voice frame is a phoneme blank and the probability that various phonemes exist in the voice frame.

In step S708, the in-vehicle speech wake-up system constructs a decoding graph (corresponding to a speech decoding network) based on the acoustic scores of the speech frames with non-null phonemes, and performs non-frame-by-frame decoding search.

Step S710, the vehicle-mounted voice awakening system determines whether a keyword exists in the voice based on the non-frame-by-frame decoding search result so as to perform awakening judgment; if the keyword exists in the voice, the voice is awakened and the step S712 is performed; if the keyword does not exist in the voice, the method does not wake up and returns to the step S702 for voice collection.

Step S712, after the vehicle-mounted voice wake-up system determines that there is a keyword in the voice, the acoustic scores of a plurality of continuous voice frames are predicted by using the model constructed by the CE loss function, and a decoding graph is constructed based on the acoustic scores of the plurality of continuous voice frames to perform frame-by-frame decoding search.

Wherein, with respect to the Transducer model, the input sequence is given by: x ═ x₁，x₂，...，x_T)∈x^*And outputting a sequence: y ═ y₁，y₂，...，y_U)∈y^*。

x^*And y^*Sets of input and output sequences, x, respectively_tE x and y_uE y are real vectors, x and y represent the input and output spaces, respectively. For example, in the present application example, the modeling unit of the driver model is a phoneme, the input sequence x is a feature vector sequence (e.g., FBank feature, MFCC feature), x_tA feature vector representing time t; the output sequence y is a phoneme sequence, y_uRepresenting the phoneme of the u-th step.

The final goal of the Transducer model training is to maximize the objective function:

i.e. the sum of all possible path probabilities is maximized after given inputs and outputs. The loss function for the Transducer model training is: l is_Transducer＝-p(y|x)。

As described above, the customized wake-up task is to detect whether a wake-up word exists in a voice, and on the other hand, to accurately obtain the position where the wake-up word appears and ends in the voice. Because the Transducer model has the problem of output delay, the time stamp of the phoneme output by the network cannot correspond to the actual position of the voice, and the delay has randomness, so when the output of the Transducer model is used for forced alignment and voice extraction, a large error is generated. In order to achieve the purpose of accurately extracting the position of the keyword, the application example adopts the model structure shown in fig. 6, and after the Encoder outputs the signal, a linear layer is arranged for predicting the posterior probability of a certain phoneme generated by the current frame.

The model of the present application example can be regarded as being constructed based on multi-task learning, i.e., learning phoneme predictions frame by frame and phoneme predictions at sequence level simultaneously. Sub-models for linear layer partial mapping, using frame-level CE loss function L_CEAnd (5) carrying out training. Finally, the model is trained by interpolation of two loss functions: l ═ L_Transducer+αL_CE。

Overall, the decoding and forced alignment of the present application example mainly includes three parts: wakeup registration, wakeup decision, and wakeup alignment, various parts of which are described in detail below. Wherein, the awakening word is a keyword.

Firstly, awakening and registering:

the viterbi decoding search usually uses WFST and acoustic probabilities as input, and performs viterbi search and beam cutoff on WFST (Weighted Finite State automata) according to the acoustic probabilities frame by frame, so as to quickly find a path with the optimal acoustic and linguistic probabilities, which is called BEST-delay.

In the application example, the wake-free engine adopts a registration mode to achieve the purpose of rapidly switching wake words, and divides the ROOT-WFST required by wake-free into two parts, namely GARBAGE (GA shown in fig. 8) and keyword; the GARBAGE part is used for absorbing the voice frame of the non-awakening word, and the KEYWORDS part is used for absorbing the voice frame of the awakening word. During WFST training, it is difficult to determine what wake words are available, so a $ KEYWORDS slot is left in the ROOT-WFST for registering the wake words, and the KEYWORDS-WFST represented by the wake words is accessed into the whole ROOT-WFST, as shown in FIG. 8.

In addition, when the wake word is registered, in order to distinguish between the wake word and the non-wake word, a portion of the output word table that is greater than or equal to KEYWORD _ START is referred to as an output of the wake word, and a portion that is smaller than KEYWORD _ START is referred to as a garget output.

II, awakening judgment:

the output of the Transducer is the probability that various phonemes exist in the speech frame and the probability that the speech frame is a phoneme blank, the probability distribution of the speech frame with non-empty phonemes is input to Viterbi search decoding by the application example, and the optimal result sequence of the acoustic score and the language score, namely BEST-LATTICE, can be obtained. In the latice sequence, the elements at each position are composed of ILABEL (input symbol), OLABEL (output symbol) and WEIGHT, where ILABEL represents the corresponding phoneme, OLABEL represents the result of decoding, and WEIGHT represents the overall cost of composing the sequence.

By using the optimal LATTICE, it is determined whether a portion in which the OLABEL > -KEYWORD _ START appears, and if so, the portion is fetched, and the result of the wakeup word is obtained.

In the actual process, the confidence degree of the changed awakening result is also required to be judged, that is, whether the ratio of the acoustic probability of the awakening word part to the optimal acoustic probability meets a set threshold value or not is checked.

Thirdly, awakening alignment:

in voice wakeup, accurate alignment information of a wakeup word is needed so as to cut voice, but the requirement of alignment cannot be met due to the fact that greater uncertain delay exists in output of a Transducer. Therefore, the application example adopts the frame-level CE function output, the optimal acoustic sequence can be output with a high probability, and because the output of the CE is frame-aligned, in an ideal case, the alignment information output by the frame-level CE is controlled within the range of the frame duration.

In order to take accuracy and ALIGNMENT accuracy into consideration, when the device is awakened, an ALIGNMENT-WFST added with SIL (SIL represents silence) and spin is constructed by obtaining awakened BEST-LATTICE, a phoneme probability distribution sequence of a CE model to a plurality of continuous speech frames and the ALIGNMENT-WFST are input into Viterbi decoding, an awakened LATTICE can also be obtained, and accurate awakened word ALIGNMENT information can be obtained by the awakened word extraction mode mentioned in the second part.

ALIGNMENT-WFST is based on BEST-LATTICE to create WFST. As shown in fig. 9, BEST-latice is divided into two parts, Garboge (GA) and keyword ds, garboge forms several GA GROUP according to pronunciation relationship, keyword ds also becomes one GROUP, GROUP is connected by spinning SIL, and inside GA also performs spinning operation to absorb the same pronunciation. Keyword inside is also divided into a plurality of GROUP connected by spinning SIL, and inside of GROUP is also spinning to absorb the same tone. The dotted spin in fig. 9 means that the element in the GROUP spins.

After the model training is completed, the libtorch is adopted to carry out the quantification and the deployment of the model. In the android version of libtorch, a QNNPACK (Quantized Neural Networks PACKage, Neural network acceleration library) is adopted to perform matrix calculation of INT8, so that the matrix calculation speed is greatly accelerated. The model is trained by a pytorch, and then the model is quantized after being trained, namely model parameters are quantized into INT8, and matrix multiplication of INT8 is adopted to accelerate calculation. The quantized model is derived for forward inference of the C + + environment, as shown in fig. 10.

Compared with the traditional voice wake-up system, the voice wake-up system provided by the application example has the advantages in accuracy and Central Processing Unit (CPU) occupancy rate:

(1) in terms of accuracy, in a noisy environment, the scheme provided by the application example greatly exceeds the existing model;

(2) in the aspect of CPU occupation, the quantification and deployment mode adopted by the application example also reduces the utilization rate of the CPU.

Table 1 shows the comparison results of the wake-up rates (%) of the DNN-HMM-based voice wake-up system and the improved Transducer model-based voice wake-up system provided in this application example:

Model	test set 1	Test set 2	Test set 3
				DNN-HMM	91.08	84.44	82.52
Transducer	96.49	96.33	95

TABLE 1

Table 2 shows the CPU occupancy rate comparison results of the DNN-HMM-based voice wake-up system and the improved Transducer model-based voice wake-up system provided by the present application example:

Model	CPU occupation (Peak value)
		DNN-HMM	21.31％
Transducer	19.94％

TABLE 2

It should be understood that, although the respective steps in the flowcharts of fig. 1(b) to 10 are sequentially shown as indicated by arrows, the steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in fig. 1(b) to 10 may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

In some embodiments, as shown in fig. 11, there is provided a voice detection apparatus including:

the feature vector processing module 1102 is configured to perform feature extraction on each speech frame in the speech to be detected to obtain a feature vector corresponding to the speech frame, and sort the feature vectors corresponding to the speech frames according to a corresponding speech frame sequence to obtain a feature vector sequence;

a candidate phoneme likelihood obtaining module 1104, configured to perform phoneme recognition based on the feature vector sequence to obtain candidate phoneme likelihoods corresponding to the speech frames;

a phoneme sequence obtaining module 1106, configured to filter a speech frame with phoneme blanks based on candidate phonemes corresponding to the speech frame, and obtain an identification phoneme sequence corresponding to the speech to be detected based on candidate phoneme likelihoods corresponding to the speech frame with phonemes;

a speech frame extraction module 1108, configured to use a phoneme sequence corresponding to a target detection word selected from the recognition phoneme sequence as a detection word phoneme sequence, and determine, based on a position of a speech frame sequence corresponding to the detection word phoneme sequence in a speech frame sequence of the speech to be detected, an extracted speech frame sequence corresponding to the target detection word in the speech to be detected;

a phoneme possibility distribution obtaining module 1110, configured to obtain a phoneme possibility distribution corresponding to each extracted speech frame in the extracted speech frame sequence; the phoneme probability distribution corresponding to the extracted voice frame is obtained by respectively carrying out voice frame phoneme detection according to the feature vectors corresponding to the extracted voice frame;

a speech detection result obtaining module 1112, configured to obtain a speech detection result corresponding to the speech to be detected based on the phoneme likelihood distribution corresponding to the extracted speech frame.

In some embodiments, the speech frame extraction module 1108 is further configured to determine a starting position and an ending position of a speech frame sequence corresponding to the detection word phoneme sequence in the to-be-detected speech; taking a forward position corresponding to the starting position as a first extraction position, and taking a backward position corresponding to the ending position as a second extraction position; and taking a voice frame sequence between the first extraction position and the second extraction position in the voice to be detected as an extracted voice frame sequence corresponding to the target detection word.

In some embodiments, the speech frame extracting module 1108 is further configured to use the number of speech frames with speech elements blank in the speech to be detected as the number of blank frames; determining the extension number of the voice frames based on the number of the blank frames; the extended number of the voice frames and the number of the blank frames form a positive correlation; and taking a forward position with the distance from the starting position as the extension number of the voice frames as a first extraction position, and taking a backward position with the distance from the ending position as the extension number of the voice frames as a second extraction position.

In some embodiments, the candidate phone likelihoods include likelihoods corresponding to phone whitespaces; the phoneme sequence obtaining module 1106 is further configured to use the speech frame with the probability degree greater than the threshold value of the probability degree corresponding to the phoneme blank as the speech frame with the phoneme blank; filtering the voice frame of the phoneme blank to obtain a decoded voice frame, and acquiring a voice decoding network formed based on the candidate phoneme possibility corresponding to the decoded voice frame; decoding is carried out based on the voice decoding network to obtain a target decoding path, and phonemes passed by the target decoding path are arranged according to a path sequence to obtain a recognition phoneme sequence corresponding to the voice to be detected.

In some embodiments, the feature vector processing module 1102 is further configured to input each speech frame in the speech to be detected to a feature extraction sub-model of a trained speech detection model for feature extraction, so as to obtain a feature vector corresponding to the speech frame; the trained voice detection model comprises a feature extraction submodel, a voice sequence recognition submodel and a voice frame phoneme detection submodel, wherein the voice sequence recognition submodel and the voice frame phoneme detection submodel are respectively connected with the feature extraction submodel; the candidate phoneme likelihood acquiring module is further configured to input the feature vector sequence into the speech sequence recognition submodel for phoneme recognition, so as to obtain candidate phoneme likelihoods corresponding to the speech frames; the phoneme possibility degree distribution acquisition module is further used for extracting the feature vectors corresponding to the extracted voice frames from the feature vector sequence, and inputting the feature vectors corresponding to the extracted voice frames into the voice frame phoneme detection sub-model for phoneme recognition; and taking the possible degree distribution of each class of phoneme output by the voice frame phoneme detection submodel in each extracted voice frame as the possible degree distribution of the phoneme corresponding to each extracted voice frame.

In some embodiments, the candidate phoneme likelihood obtaining module 1104 is further configured to sequentially obtain current feature vectors in the feature vector sequence according to the order of the feature vector sequence; acquiring a phoneme expression vector corresponding to a previous feature vector corresponding to the current feature vector; the phoneme representation vector corresponding to the previous feature vector is a representation vector corresponding to a phoneme obtained by inputting the previous feature vector into the speech sequence recognition submodel for phoneme recognition; and inputting the phoneme expression vector corresponding to the previous feature vector and the current feature vector into the speech sequence recognition submodel for phoneme recognition to obtain the candidate phoneme likelihood corresponding to the current speech frame corresponding to the current feature vector.

In some embodiments, the speech detection result obtaining module 1112 is further configured to obtain a detection word position of the target detection word in the speech to be detected based on the phoneme likelihood distribution corresponding to the extracted speech frame; acquiring the voice of the backward position corresponding to the position of the detection word in the voice to be detected as instruction detection voice; and carrying out voice instruction detection on the instruction detection voice to obtain a target voice instruction corresponding to the instruction detection voice, so as to carry out voice control based on the target voice instruction.

For the specific limitation of the voice detection device, reference may be made to the above limitation of the voice detection method, and details are not described herein. The modules in the voice detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In some embodiments, a computer device is provided, the internal structure of which may be as shown in fig. 12. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store voice detection data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech detection method.

Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In some embodiments, a computer device is provided, comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the above-described method embodiments when executing the computer program.

In some embodiments, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the respective method embodiments described above.

In some embodiments, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of the various method embodiments described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for speech detection, the method comprising:

extracting the characteristics of each voice frame in the voice to be detected to obtain characteristic vectors corresponding to the voice frames, and sequencing the characteristic vectors corresponding to the voice frames according to the corresponding voice frame sequence to obtain a characteristic vector sequence;

performing phoneme recognition based on the feature vector sequence to obtain candidate phoneme likelihoods corresponding to the voice frames;

filtering a speech frame with phoneme blanks possibly based on the candidate phonemes corresponding to the speech frame, and obtaining a recognition phoneme sequence corresponding to the speech to be detected based on the candidate phoneme possibility corresponding to the speech frame with phonemes;

using a phoneme sequence corresponding to a target detection word selected from the recognition phoneme sequence as a detection word phoneme sequence, and determining an extracted voice frame sequence corresponding to the target detection word in the voice to be detected based on the position of a voice frame sequence corresponding to the detection word phoneme sequence in the voice frame sequence of the voice to be detected;

acquiring phoneme probability distribution corresponding to each extracted voice frame in the extracted voice frame sequence; the phoneme probability distribution corresponding to the extracted voice frame is obtained by respectively carrying out voice frame phoneme detection according to the feature vectors corresponding to the extracted voice frame;

and obtaining a voice detection result corresponding to the voice to be detected based on the phoneme probability distribution corresponding to the extracted voice frame.

2. The method according to claim 1, wherein the determining the extracted speech frame sequence corresponding to the target detection word in the speech to be detected based on a position of the speech frame sequence corresponding to the detection word phoneme sequence in the speech frame sequence of the speech to be detected comprises:

determining the initial position and the end position of the voice frame sequence corresponding to the detection word phoneme sequence in the voice to be detected;

taking a forward position corresponding to the starting position as a first extraction position, and taking a backward position corresponding to the ending position as a second extraction position;

and taking a voice frame sequence between the first extraction position and the second extraction position in the voice to be detected as an extracted voice frame sequence corresponding to the target detection word.

3. The method according to claim 2, wherein the taking a forward position corresponding to the start position as a first extraction position and a backward position corresponding to the end position as a second extraction position comprises:

taking the number of the voice frames with the blank phonemes in the voice to be detected as the number of blank frames;

determining the extension number of the voice frames based on the number of the blank frames; the extended number of the voice frames and the number of the blank frames form a positive correlation;

and taking a forward position with the distance from the starting position as the extension number of the voice frames as a first extraction position, and taking a backward position with the distance from the ending position as the extension number of the voice frames as a second extraction position.

4. The method of claim 1 wherein said candidate phone likelihoods comprise likelihoods corresponding to phone whitespaces; the obtaining a recognition phoneme sequence corresponding to the speech to be detected based on the speech frame with the phoneme possibility of filtering the phoneme blank based on the candidate phoneme possibility corresponding to the speech frame with the phoneme, includes:

taking the voice frame with the possibility degree corresponding to the phoneme blank larger than the possibility degree threshold value as a voice frame of the phoneme blank;

filtering the voice frame of the phoneme blank to obtain a decoded voice frame, and acquiring a voice decoding network formed based on the candidate phoneme possibility corresponding to the decoded voice frame;

decoding is carried out based on the voice decoding network to obtain a target decoding path, and phonemes passed by the target decoding path are arranged according to a path sequence to obtain a recognition phoneme sequence corresponding to the voice to be detected.

5. The method according to claim 1, wherein the extracting features of each speech frame in the speech to be detected to obtain the feature vector corresponding to the speech frame comprises:

inputting each voice frame in the voice to be detected into a feature extraction sub-model of a trained voice detection model for feature extraction to obtain a feature vector corresponding to the voice frame; the trained voice detection model comprises a feature extraction submodel, a voice sequence recognition submodel and a voice frame phoneme detection submodel, wherein the voice sequence recognition submodel and the voice frame phoneme detection submodel are respectively connected with the feature extraction submodel;

the performing phoneme recognition based on the feature vector sequence to obtain candidate phoneme likelihoods corresponding to the speech frames includes:

inputting the characteristic vector sequence into the voice sequence recognition submodel for phoneme recognition to obtain candidate phoneme likeliness corresponding to each voice frame;

the obtaining of the phoneme likelihood distribution corresponding to each extracted speech frame in the extracted speech frame sequence includes:

extracting the feature vector corresponding to the extracted voice frame from the feature vector sequence, and inputting the feature vector corresponding to the extracted voice frame into the voice frame phoneme detection submodel for phoneme recognition;

and taking the possible degree distribution of each class of phoneme output by the voice frame phoneme detection submodel in each extracted voice frame as the possible degree distribution of the phoneme corresponding to each extracted voice frame.

6. The method of claim 5, wherein the step of obtaining the trained speech detection model comprises:

acquiring a training voice;

inputting the training voice into a feature extraction submodel to be trained for feature extraction to obtain training feature vectors corresponding to the training voice frames, and sequencing the training feature vectors corresponding to the training voice frames according to the corresponding voice frame sequence to obtain a training vector sequence;

inputting the training vector sequence into a voice sequence recognition sub-model to be trained for phoneme recognition to obtain a phoneme sequence recognition result;

obtaining a first model loss value based on a first difference between the phoneme sequence recognition result and a standard recognition result corresponding to the training speech; the first model loss value is positively correlated with the first difference;

inputting each training feature vector into the voice frame phoneme detection submodel to be trained for phoneme recognition to obtain a voice frame phoneme detection result corresponding to the voice frame phoneme detection submodel;

obtaining a second model loss value based on a second difference between the speech frame phoneme detection result and a standard recognition result corresponding to the training speech; the first model loss value is in positive correlation with the second difference;

deriving a target model loss value based on the first model loss value and the second model loss value;

and adjusting parameters of the feature extraction submodel to be trained, the voice frame phoneme detection submodel to be trained and the voice sequence identification submodel to be trained based on the loss value of the target model, and forming the trained voice detection model by the feature extraction submodel after parameter adjustment, the voice frame phoneme detection submodel after parameter adjustment and the voice sequence identification submodel after parameter adjustment.

7. The method of claim 5, wherein the inputting the feature vector sequence into the speech sequence recognition submodel for phoneme recognition to obtain a candidate phoneme likelihood corresponding to each of the speech frames comprises:

sequentially acquiring current eigenvectors in the eigenvector sequence according to the order of the eigenvector sequence;

acquiring a phoneme expression vector corresponding to a previous feature vector corresponding to the current feature vector; the phoneme representation vector corresponding to the previous feature vector is a representation vector corresponding to a phoneme obtained by inputting the previous feature vector into the speech sequence recognition submodel for phoneme recognition;

and inputting the phoneme expression vector corresponding to the previous feature vector and the current feature vector into the speech sequence recognition submodel for phoneme recognition to obtain the candidate phoneme likelihood corresponding to the current speech frame corresponding to the current feature vector.

8. The method of claim 7, wherein the step of obtaining the phoneme representation vector corresponding to the previous feature vector comprises:

inputting the previous feature vector into the voice sequence recognition sub-model for phoneme recognition to obtain candidate phoneme likeliness corresponding to the previous feature vector; the candidate phoneme likelihood corresponding to the previous feature vector comprises the likelihood corresponding to each candidate phoneme in the candidate phoneme set;

selecting a candidate phoneme with the maximum corresponding possibility from the candidate phoneme set as a target phoneme;

and taking the phoneme representation vector corresponding to the target phoneme as the phoneme representation vector corresponding to the previous feature vector.

9. The method of claim 7, wherein after inputting the phoneme representation vector corresponding to the previous feature vector and the current feature vector into the speech sequence recognition submodel for phoneme recognition to obtain the candidate phoneme likeliness corresponding to the current speech frame corresponding to the current feature vector, the method further comprises:

and when the phoneme recognition of the feature vectors in the feature vector sequence is not finished, returning to the step of sequentially acquiring the current feature vectors in the feature vector sequence according to the sequence of the feature vector sequence so as to perform phoneme recognition on the feature vectors which are not subjected to the phoneme recognition until the phoneme recognition of the feature vectors in the feature vector sequence is finished.

10. The method according to claim 1, wherein obtaining the speech detection result corresponding to the speech to be detected based on the phoneme likelihood distribution corresponding to the extracted speech frame comprises:

obtaining the detection word position of the target detection word in the voice to be detected based on the phoneme probability distribution corresponding to the extracted voice frame;

acquiring the voice of the backward position corresponding to the position of the detection word in the voice to be detected as instruction detection voice;

and carrying out voice instruction detection on the instruction detection voice to obtain a target voice instruction corresponding to the instruction detection voice, so as to carry out voice control based on the target voice instruction.

11. A speech detection apparatus, characterized in that the apparatus comprises:

the characteristic vector processing module is used for extracting the characteristics of each voice frame in the voice to be detected to obtain a characteristic vector corresponding to the voice frame, and sequencing the characteristic vectors corresponding to the voice frames according to the corresponding voice frame sequence to obtain a characteristic vector sequence;

a candidate phoneme possibility degree obtaining module, configured to perform phoneme recognition based on the feature vector sequence to obtain candidate phoneme possibility degrees corresponding to the speech frames;

a phoneme sequence obtaining module, configured to filter a speech frame with phoneme blanks based on the candidate phonemes corresponding to the speech frame, and obtain an identification phoneme sequence corresponding to the speech to be detected based on the candidate phoneme likeliness corresponding to the speech frame with phonemes;

a voice frame extraction module, configured to use a phoneme sequence corresponding to a target detection word selected from the recognition phoneme sequence as a detection word phoneme sequence, and determine, based on a position of a voice frame sequence corresponding to the detection word phoneme sequence in a voice frame sequence of the voice to be detected, an extracted voice frame sequence corresponding to the target detection word in the voice to be detected;

a phoneme possibility degree distribution obtaining module, configured to obtain a phoneme possibility degree distribution corresponding to each extracted speech frame in the extracted speech frame sequence; the phoneme probability distribution corresponding to the extracted voice frame is obtained by respectively carrying out voice frame phoneme detection according to the feature vectors corresponding to the extracted voice frame;

and the voice detection result acquisition module is used for acquiring a voice detection result corresponding to the voice to be detected based on the phoneme probability distribution corresponding to the extracted voice frame.

12. The apparatus of claim 11, wherein the speech frame extraction module is further configured to:

13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 10 when executing the computer program.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 10.

15. A computer program product comprising a computer program, characterized in that the computer program realizes the method of any of claims 1 to 10 when executed by a processor.