CN115240716A

CN115240716A - Voice detection method, device and storage medium

Info

Publication number: CN115240716A
Application number: CN202110440811.3A
Authority: CN
Inventors: 房雷; 耿杰
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2022-10-25

Abstract

The application relates to the field of voice detection in the technical field of artificial intelligence, in particular to a voice detection method, a voice detection device and a storage medium, wherein the method comprises the following steps: determining a second text symbol from the first text symbol and the first audio signal in the sequence of audio signals; determining whether the semantics of the second text symbol reach a rear endpoint according to the first text symbol; and under the condition that the semantic meaning of the second text symbol does not reach a rear endpoint, taking the second text symbol as a new first text symbol, taking the audio signal in the audio signal sequence after the first audio signal as a new first audio signal, and repeatedly executing the steps of determining the second text symbol and the following steps according to the first text symbol and the first audio signal in the audio signal sequence. According to the embodiment of the application, the rear end point of the input audio can be accurately determined based on the semantics, misjudgment is prevented, the number and the deployment process of the models are simplified, and the accuracy of the models is improved.

Description

Voice detection method, device and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for voice detection, and a storage medium.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. That is, artificial intelligence studies the design principle and implementation method of various intelligent machines, so that the machine has the functions of perception, reasoning and decision making.

Voice detection is an important field of AI, and as technologies such as voice wakeup and voice recognition are more and more widely applied to life, voice detection is regarded as a necessary front-end process thereof. The voice detection is used for detecting whether voice exists in the current environment or not and positioning the starting position and the ending position of the voice, so that voice segments are separated from noise and sent to the rear end for processing such as voice recognition and awakening. At present, a voice detection model capable of accurately positioning a voice starting position and a voice ending position is lacked, especially the voice ending position is easily influenced by voice pause and the like, so that the risk of wrong judgment is caused, and a more efficient and more accurate voice detection method is urgently needed.

Disclosure of Invention

In view of the above, a method, an apparatus and a storage medium for voice detection are provided.

In a first aspect, an embodiment of the present application provides a speech detection method, including: determining a second text symbol according to a first text symbol and a first audio signal in an audio signal sequence, wherein the initial value of the first text symbol is a null character, and the second text symbol corresponds to the content of the first audio signal; determining whether the semantics of the second text symbol reach a rear endpoint according to the first text symbol, wherein the rear endpoint represents the end of speech in the audio signal sequence; and under the condition that the semantics of the second text symbol does not reach a rear endpoint, taking the second text symbol as a new first text symbol, taking an audio signal in the audio signal sequence after the first audio signal as a new first audio signal, and repeatedly executing the steps of determining the second text symbol and the following steps according to the first text symbol and the first audio signal in the audio signal sequence.

According to the embodiment of the application, the second text symbol is determined according to the first text symbol and the first audio signal of the audio signal sequence, whether the semanteme of the second text symbol reaches the rear end point is determined according to the first text symbol, whether the text symbol reaches the rear end point can be judged according to the semanteme, the phenomenon that a user finishes reception in advance when speaking a voice card is stopped is avoided, misjudgment is prevented, the detection of the rear end point is more accurate, and the user experience is improved. Under the condition that the semantics of the second text symbol does not reach the rear end point, the second text symbol is taken as a new first text symbol, the audio signal in the audio signal sequence behind the first audio signal is taken as a new first audio signal, and the steps of determining the second text symbol and the subsequent steps are repeatedly executed according to the first text symbol and the first audio signal in the audio signal sequence, so that the rear end point detection of the audio can be realized without depending on a complex and manually set multi-model decision process, the voice detection method is more flexible, and the accuracy of the voice detection method is also improved.

According to the first aspect, in a first possible implementation manner of the voice detection method, the method further includes: sequentially detecting whether audio frames contained in each audio signal in the audio signal sequence reach a front end point, wherein the front end point represents the beginning of voice in the audio signal sequence; and when detecting the first audio frame reaching the front end point, determining the audio signal where the audio frame is positioned as the first audio signal, and stopping detection.

According to the embodiment of the application, whether the audio frames contained in the audio signals in the audio signal sequence reach the front end point or not is detected in sequence, when the first audio frame reaching the front end point is detected, the first audio signal where the audio frame is located is determined as the first audio signal, and detection is stopped.

According to a first possible implementation manner of the first aspect, in a second possible implementation manner of the speech detection method, the method is used for a target detection model, where the target detection model includes a speech behavior detection module, an association network module, a semantic endpoint detection module, an encoder module, and a prediction network module, and the encoder module is used to obtain a first feature vector of an audio frame included in an audio signal sequence; the voice behavior detection module is used for determining whether audio frames contained in each audio signal in the audio signal sequence reach a front end point or not according to the first feature vector; the prediction network module is used for obtaining a second feature vector of the first text symbol; the associated network module is used for obtaining the second text symbol according to the first feature vector and the second feature vector; the semantic endpoint detection module is used for determining whether the semantics of the second text symbol reaches a rear endpoint according to the second feature vector.

According to the embodiment of the application, the target detection model comprises the voice behavior detection module, the association network module, the semantic endpoint detection module, the encoder module and the prediction network module, the VAD model and the ASR model can be integrated in one model, the front endpoint can be detected while the rear endpoint is detected based on the semantic, the number of models and the deployment process are greatly simplified, a large number of resources are saved, and the accuracy of voice detection is improved based on the rear endpoint detected based on the semantic.

According to the second possible implementation manner of the first aspect, in a third possible implementation manner of the voice detection method, the method further includes: pre-training the prediction network module and the semantic endpoint detection module; and training a target detection model comprising the voice behavior detection module, the associated network module, the encoder module, the pre-trained prediction network module and the pre-trained semantic endpoint detection module to obtain a trained target detection model.

According to the embodiment of the application, the prediction network module and the semantic endpoint detection module are pre-trained, the target detection module comprising the voice behavior detection module, the association network module, the encoder module, the pre-trained prediction network module and the pre-trained semantic endpoint detection module is trained, and the trained target detection module is obtained, so that the training process is more targeted, the rear endpoint based on semantic judgment is more accurate, and through pre-training, a subsequent training model can obtain a better training effect, and the prediction accuracy of the trained target detection model is higher.

According to the second possible implementation manner of the first aspect, in a fourth possible implementation manner of the voice detection method, the pre-training of the prediction network module and the semantic endpoint detection module includes: inputting a previous text symbol in a text sample into a prediction network module, and predicting by the prediction network module according to a feature vector of the previous text symbol to obtain a current text symbol; inputting the current text symbol into a semantic endpoint detection module, and determining whether the semantics of the current text symbol reaches a rear endpoint; calculating a first loss value of the prediction network module according to the label of the current text symbol, calculating a second loss value of the semantic endpoint detection module according to the label of whether the current text symbol reaches the rear endpoint, and performing parameter adjustment on the prediction network module and the semantic endpoint detection module according to the first loss value and the second loss value.

According to the embodiment of the application, a previous text symbol in a text sample is input into a prediction network module, the prediction network module predicts to obtain a current text symbol according to a feature vector of the previous text symbol, the current text symbol is input into a semantic endpoint detection module, whether the semantics of the current text symbol reach a rear endpoint is determined, a first loss value of the prediction network module is calculated according to the marking of the current text symbol, a second loss value of the semantic endpoint detection module is calculated according to the marking of the current text symbol whether the current text symbol reach the rear endpoint, and the prediction network module and the semantic endpoint detection module are subjected to parameter adjustment according to the first loss value and the second loss value, so that the function of the model based on the semantic detection rear endpoint can be trained more specifically in a pre-training stage, the accuracy of the model for voice detection is improved, the pre-training stage is trained by using the easily-obtained text sample, no audio sample is needed, relevant resources can be saved, and the accuracy of the model for voice detection based on the semantics can be further improved.

According to the second possible implementation manner of the first aspect, in a fifth possible implementation manner of the speech detection method, the speech behavior detection module stops running when it is determined that an audio frame included in each audio signal in the audio signal sequence reaches a front end point according to the first feature vector.

According to the embodiment of the application, the operation of the voice behavior detection module can be suspended after the front end point is determined, so that resources are saved.

In a second aspect, an embodiment of the present application provides a speech detection apparatus, including: a first determining module, configured to determine a second text symbol according to a first text symbol and a first audio signal in an audio signal sequence, where an initial value of the first text symbol is a null character, and the second text symbol corresponds to content of the first audio signal; a second determining module, configured to determine, according to the first text symbol, whether a semantic meaning of the second text symbol reaches a back endpoint, where the back endpoint represents an end of speech in an audio signal sequence; and a third determining module, configured to, when the semantic meaning of the second text symbol does not reach a back end point, use the second text symbol as a new first text symbol, use an audio signal in the audio signal sequence after the first audio signal as a new first audio signal, and repeatedly perform the steps of determining the second text symbol and the subsequent steps according to the first text symbol and the first audio signal in the audio signal sequence.

According to a second aspect, in a first possible implementation manner of the voice detection apparatus, the apparatus further includes: the detection module is used for sequentially detecting whether audio frames contained in each audio signal in the audio signal sequence reach a front end point or not, wherein the front end point represents the beginning of voice in the audio signal sequence; and the fourth determining module is used for determining the audio signal where the audio frame is positioned as the first audio signal when the first audio frame reaching the front end point is detected, and stopping detection.

According to the first possible implementation manner of the second aspect, in a second possible implementation manner of the speech detection apparatus, the apparatus is used for a target detection model, where the target detection model includes a speech behavior detection module, an association network module, a semantic endpoint detection module, an encoder module, and a prediction network module, and the encoder module is used to obtain a first feature vector of an audio frame included in an audio signal sequence; the voice behavior detection module is used for determining whether audio frames contained in each audio signal in the audio signal sequence reach a front end point or not according to the first feature vector; the prediction network module is used for obtaining a second feature vector of the first text symbol; the associated network module is used for obtaining the second text symbol according to the first feature vector and the second feature vector; the semantic endpoint detection module is used for determining whether the semantics of the second text symbol reaches a rear endpoint according to the second feature vector.

In a third possible implementation manner of the voice detection apparatus according to the second possible implementation manner of the second aspect, the apparatus further includes: the pre-training module is used for pre-training the prediction network module and the semantic endpoint detection module; and the training module is used for training a target detection model comprising the voice behavior detection module, the association network module, the coder module, the pre-trained prediction network module and the pre-trained semantic endpoint detection module to obtain the trained target detection model.

In a fourth possible implementation manner of the voice detection apparatus according to the second possible implementation manner of the second aspect, the pre-training module is configured to: inputting a previous text symbol in a text sample into a prediction network module, and predicting by the prediction network module according to a feature vector of the previous text symbol to obtain a current text symbol; inputting the current text symbol into a semantic endpoint detection module, and determining whether the semantics of the current text symbol reaches a rear endpoint; calculating a first loss value of the prediction network module according to the label of the current text symbol, calculating a second loss value of the semantic endpoint detection module according to the label of whether the current text symbol reaches the rear endpoint, and performing parameter adjustment on the prediction network module and the semantic endpoint detection module according to the first loss value and the second loss value.

In a fifth possible implementation manner of the speech detection apparatus according to the second possible implementation manner of the second aspect, the speech behavior detection module stops operation when it is determined that an audio frame included in each audio signal in the audio signal sequence reaches a front end point according to the first feature vector.

In a third aspect, an embodiment of the present application provides a speech detection apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the speech detection method of the first aspect or one or more of the many possible implementations of the first aspect when executing the instructions.

In a fourth aspect, an embodiment of the present application provides a non-transitory computer-readable storage medium, on which computer program instructions are stored, and the computer program instructions, when executed by a processor, implement the voice detection method of the first aspect or one or more of the multiple possible implementation manners of the first aspect.

In a fifth aspect, an embodiment of the present application provides a terminal device, where the terminal device may perform the voice detection method of the first aspect or one or more of multiple possible implementation manners of the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product, which includes computer readable code or a non-transitory computer readable storage medium carrying computer readable code, and when the computer readable code runs in an electronic device, a processor in the electronic device executes a voice detection method of the first aspect or one or more of the many possible implementations of the first aspect.

These and other aspects of the present application will be more readily apparent from the following description of the embodiment(s).

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic diagram illustrating an implementation environment of a speech detection method according to an embodiment of the present application.

FIG. 2 shows a flow diagram of a training method of a model in a pre-training phase according to an embodiment of the present application.

Fig. 3 shows a schematic diagram of an input-output sequence of a pre-training phase according to an embodiment of the present application.

FIG. 4 shows a flow diagram of a training method of a model of an overall training phase according to an embodiment of the application.

FIG. 5 is a diagram illustrating an input-output sequence of an overall training phase according to an embodiment of the present application.

FIG. 6 shows a flow diagram of a model prediction phase according to an embodiment of the present application.

FIG. 7 shows a flow diagram of a method of speech detection according to an embodiment of the application.

FIG. 8 shows a flow diagram of a method of speech detection according to an embodiment of the application.

FIG. 9 shows a flow diagram of a method of speech detection according to an embodiment of the present application.

FIG. 10 shows a flow diagram of a method of speech detection according to an embodiment of the present application.

Fig. 11 shows a block diagram of a voice detection apparatus according to an embodiment of the present application.

Fig. 12 shows a schematic structural diagram of a terminal device according to an embodiment of the present application.

FIG. 13 shows a block diagram of an electronic device according to an embodiment of the application.

Detailed Description

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.

In the prior art, a Voice Activity Detection (VAD) model and an Automatic Speech Recognition (ASR) model are at least two independent models, so that an audio interval needs to be recognized between the models depending on a text recognition result, and an audio interval recognition rule needs to be manually configured, so that the flexibility is poor, the recognition is slow, the accuracy of judgment is low, and a large amount of resources are consumed; in the prior art, an ASR model fusing back endpoint recognition of a VAD model also exists, but the ASR model depends on the labeling of training data of audio, cannot recognize a front endpoint of the audio, and has a poor effect when performing audio back endpoint recognition based on text semantics.

In order to solve the technical problem, the voice detection method provided by the application can determine whether a user has the intention of continuing to speak based on the semantics of the recognized text, does not need to manually set rules for judgment, is more flexible, can determine the rear end point of the input audio more accurately, and prevents misjudgment from ending reception in advance.

Fig. 1 is a schematic diagram illustrating an implementation environment of a speech detection method according to an embodiment of the present application. As shown in fig. 1, the implementation environment may include a terminal device and a voice detection platform.

Referring to fig. 1, the terminal device may be a vehicle-mounted terminal 101, a smartphone 102, a smart speaker 103, or a robot 104. Of course, several terminal devices shown in fig. 1 are only examples, and the terminal device may also be other electronic devices that support a voice detection function, such as a netbook, a tablet computer, a notebook computer, a wearable electronic device (e.g., a smart band, a smart watch, and the like), a TV, a virtual reality device, a sound, electronic ink, and the like. The terminal device may be run with an application that supports voice detection. The application may be a navigation application, a voice assistant, a smart question and answer application, and the like. Illustratively, the terminal device is a terminal device used by a user, and a user account is logged in an application program run by the terminal device, and the user account can be registered in the voice detection platform in advance. The terminal equipment can be connected with the voice detection platform in a wireless connection mode or a wired connection mode, and the wireless connection mode refers to the mode that the terminal equipment can be connected with the voice detection platform in a wireless connection mode such as wifi and Bluetooth.

The voice detection platform is used for providing background service for the application programs supporting voice detection. For example, the voice detection platform may perform method embodiments of training a voice detection model (e.g., a target detection model of the embodiments of the present application, hereinafter), sending the model to the terminal device (e.g., the model may be exported from the platform by a model format conversion tool for deployment on the terminal device), so that the terminal device uses the model for voice detection. The terminal equipment can accurately identify the front end point and the rear end point of the voice in the received audio by utilizing the model, and simultaneously carries out voice identification, thereby realizing subsequent applications such as voice control and the like.

The voice detection platform may include a server 201 and a database 202. The server 201 may be a single server or a cluster of multiple servers. The database 202 may be used to store training data, such as training data that may contain large amounts of text, training data for audio, and the like. The server 201 may access the database 202 to obtain training data stored in the database 202, and obtain a model through training of the training data.

Those skilled in the art will appreciate that the number of terminal devices, servers, or databases described above may be greater or fewer. For example, the number of the terminal devices, servers, or databases may be only one, or several tens or hundreds, or more, and other terminal devices, other servers, or other databases may also be included, although not shown in the figure.

The above exemplary description is a system architecture, and the following exemplary description is a method flow for performing voice detection based on the system architecture provided above.

The method flow of speech detection may include a model training phase and a model prediction phase. The method flow of the model training phase is described below with the embodiments of fig. 2 and 4, and the method flow of the model prediction phase is described with the embodiment of fig. 6.

The process of training the model can comprise two stages of pre-training and overall training.

FIG. 2 shows a flow diagram of a training method of a model in a pre-training phase according to an embodiment of the present application. As shown in fig. 2, the method may be applied to an electronic device, where the electronic device may be a terminal device in the system architecture shown in fig. 1, or may be a voice detection platform in the system architecture shown in fig. 1, such as a server 201. The training data used in the pre-training stage may be text data, and the text data may include text and a label indicating whether the semantic meaning corresponding to each word (which may be a text symbol) in the text is complete. The model in the pre-training stage includes a prediction network module and a semantic endpoint detection module, where the prediction network module may be a twelve-layer structure of a transformer, the semantic endpoint detection module may be a two-layer long-short-term memory (LSTM) structure or other structure capable of detecting a back endpoint, and the prediction network module and the semantic endpoint detection module may also be other structures, which are not limited in this application, y (u-1) may represent a last text symbol in an input text sequence, y (u) may represent a current text symbol, e (u) may represent whether the current text symbol reaches the back endpoint, and u may represent a sequence number of the text symbol, where the prediction network module is configured to obtain a high-dimensional feature vector corresponding to a last text symbol y (u-1) in the text sequence, and obtain a current text symbol y (u) according to the high-dimensional feature vector, and the semantic endpoint detection module is configured to determine whether the current text symbol y (u) reaches the back endpoint (i.e., whether the back endpoint is reached).

In the pre-training stage, for a certain text sequence in the training data, a last text symbol y (u-1) in the text sequence may be used as an input of the prediction network module, a current text symbol y (u) of the text sequence may be used as an output of the prediction network module, and whether the current text symbol y (u) reaches a back end point e (u) (for example, when the back end point is reached, e (u) takes a value of 1, otherwise, 0) is used as an output of the semantic end point detection module, so that the trained model may determine whether the semantics of the current text symbol reaches an end of a sentence according to a previous text symbol input, that is, whether the current text symbol y (u) is a last text symbol at the end of a sentence, and perform loss optimization on the output of the prediction network module by using cross-entropy loss, perform loss optimization on the output of the semantic end point detection module by using binary cross-entropy loss (binary cross-entropy loss), until an optimized training loss value corresponding to the model in the pre-training stage converges to a predetermined value, and obtain the pre-training model in the pre-training stage.

The prediction network module and the semantic endpoint detection module can be used as one part of an integral model, only the prediction network module and the semantic endpoint detection module are trained in a pre-training stage, so that training resources can be saved, and a final target detection model can obtain a better training effect by performing targeted training. The pre-training stage can only use the training data of the text type, the data of the text type can be obtained in a purchasing or crawling mode, the method is not limited by the application, and the pre-training is performed through the data of the text type because the data of the text type is easy to obtain, so that the accuracy of performing voice detection on the target model based on semantics can be better improved, and meanwhile, related resources of the pre-training stage are saved.

Fig. 3 shows a schematic diagram of an input-output sequence of a pre-training phase according to an embodiment of the present application. As shown in fig. 3, each text symbol in the input sequence of the prediction network module (e.g., the text symbol in each box in the input sequence of the prediction network module of fig. 3) may correspond to a respective y (u-1), each text symbol in the output sequence of the prediction network module (e.g., the text symbol in each box in the output sequence of the prediction network module of fig. 3) may correspond to a respective y (u), and each value in the output sequence of the semantic endpoint detection module (e.g., the value in each dashed box in the output sequence of the semantic endpoint detection module of fig. 3) may correspond to a respective e (u).

Taking fig. 3 as an example, the text data in the training data may be "what weather is today", the input sequence of the corresponding prediction network module may include several text symbols "present", "day", "qi", "what", and "which", the output sequence of the corresponding prediction network module "day", "qi", "what", "sample", and the output sequence 000001 of the corresponding semantic endpoint detection module, for example, in a case where the text symbol input by the prediction network module is "present", the text symbol output by the corresponding prediction network module may be "day", at this time, the value output by the semantic endpoint detection module may be 0, which indicates that the end of a sentence is not reached (e (u) is 1, which indicates that the end of a sentence is reached); in the case that the text symbol input by the prediction network module is "no", the text symbol output by the prediction network module may be "sample", in this case, the value output by the semantic endpoint detection module may be 1, which indicates that the end of the sentence is reached, that is, the sentence is terminated.

Under the condition that the model of the trained pre-training stage is obtained in the pre-training stage, the trained model can be spliced with other modules to obtain the model of the whole training stage.

FIG. 4 shows a flow diagram of a training method of a model of an overall training phase according to an embodiment of the application. As shown in fig. 4, the training data used in the overall training stage may include a speech corpus, the speech corpus may include audio data and text data corresponding to the audio data, the audio data may include audio and a label indicating whether each frame of the audio contains human voice, the text data may include text and a label indicating whether the semantic meaning corresponding to each word (i.e. each text symbol) in the text is complete (e.g. not the end of a complete sentence), where the content of the text data corresponds to the content of human voice. On the basis of a pre-trained prediction network module and a semantic endpoint detection module, the model in the whole training stage further comprises an encoder module, a voice behavior detection module and an associated network module, wherein the encoder module can be a six-layer transform structure or other structures capable of extracting high-dimensional feature vectors of audio, the voice behavior detection module can be a one-layer LSTM structure or other structures capable of detecting a front endpoint, the associated network module can be a one-layer full-connection network structure or other structures capable of predicting a next text symbol, and the encoder module, the voice behavior detection module and the associated network module can also be other structures. x (t) may represent the audio of the current frame, v (t) may represent whether the audio of the current frame contains human voice, and t may represent the sequence number of the audio frame. The system comprises an encoder module, a voice behavior detection module, an association network module and a prediction network module, wherein the encoder module is used for obtaining a high-dimensional feature vector of a current frame of audio according to the input audio x (t) of the current frame, the voice behavior detection module is used for detecting whether the input x (t) contains human voice according to the high-dimensional feature vector, the association network module is used for predicting a current text symbol y (u) according to the high-dimensional feature vector of the audio output by the encoder module and a high-dimensional feature vector of a previous text symbol output by the prediction network module, the prediction network module outputs the predicted current text symbol in the pre-training process, and the prediction network module outputs an intermediate result, namely the high-dimensional feature vector of the previous text symbol in the integral training stage.

In the whole training stage, for a certain section of audio in the training data, the audio x (t) of each frame can be used as the input of the encoder module, for the text data corresponding to the audio in the training data, the last text symbol y (u-1) in the text sequence can be used as the input of the prediction network module, corresponding to the two inputs, whether the audio of the current frame contains human voice v (t) (when the audio contains human voice, v (t) takes the value of 1, otherwise, 0) can be used as the output of the voice behavior detection module, the current text symbol y (u) of the text sequence can be used as the output of the association network module, whether the current text symbol reaches the rear end point e (u) can be used as the output of the semantic end point detection module, binary cross entropy loss is used for carrying out loss optimization on the output v (t) of the voice behavior detection module and the output e (u) of the semantic end point detection module, converter loss (transformer loss) is used for carrying out loss optimization on the output y (u) of the association network module, so as to solve the problem that the serial number t of the audio frame and the serial number of the text symbol do not correspond to the loss, and the problem that the training model can be obtained through the whole training model can be well as the whole training model, until the corresponding to the target of the model can be obtained through the optimization.

FIG. 5 is a diagram illustrating an input-output sequence of an overall training phase according to an embodiment of the present application. As shown in fig. 5, each frame of audio in the input audio sequence of the encoder module may correspond to a corresponding x (t), the input sequence of the prediction network module may refer to fig. 3, each text symbol in the output sequence of the association network module (e.g., the text symbol in each box in the output sequence of the association network module of fig. 5) may correspond to a corresponding y (u), each numerical value in the output sequence of the semantic endpoint detection module (e.g., the value in each dashed box in the output sequence of the semantic endpoint detection module of fig. 5) may correspond to a corresponding e (u), each numerical value in the output sequence of the speech behavior detection module (e.g., the value in each dashed box in the output sequence of the speech behavior detection module of fig. 5) may correspond to a corresponding v (t), wherein each v (t) may correspond to an audio frame, each y (u), e (u) may correspond to a text symbol, and a text symbol may correspond to one or more audio frames, which is not limited in this application. In the case where one text symbol corresponds to a plurality of audio frames, one y (u) can be obtained from a plurality of x (t) and one y (u-1), and the plurality of x (t) corresponding to one text symbol can be referred to as "audio signals" hereinafter.

Taking fig. 5 as an example, for a certain segment of input audio and text sequence (e.g., "how is the weather today") in the training data, the output sequence "today", "day", "qi", "how", "sample" of the corresponding associated network module, the output sequence 0000001 of the corresponding semantic endpoint detection module, and the output sequence 00111110011110 of the corresponding voice behavior detection module may be output. For a certain segment of audio (e.g., the input audio of the encoder module in fig. 5), the input certain frame of audio x (t) may correspond to an output y (u) (i.e., a predicted current text symbol, e.g., "today"), or the input certain frame of audio x (t) may correspond to an output y (u), and similarly, the input certain frame of audio x (t) may correspond to an output e (u), or the input certain frame of audio x (t) may correspond to an output e (u); the input audio x (t) of a certain frame can also correspond to an output v (t), when the value of v (t) is 1, the frame x (t) contains human voice, otherwise, the value is 0. Wherein the first input y (u-1) of the prediction network module in a certain text sequence may be a default value (e.g., a null character), and a first output y (u) of the associated network module (e.g., the first text character "now" in the output sequence of the associated network module in fig. 5) may be determined according to the default value and the corresponding at least one audio frame x (t).

The trained target detection model structurally fuses a VAD model and an ASR model, so that a plurality of models can be fused in one model, wherein the correlation network module, the encoder module and the prediction network module can be as a converter transformer structure in the ASR model to recognize texts in audio, the voice behavior detection module and the semantic endpoint detection module can be respectively used as VAD models for detecting front and rear endpoints of the audio, the front and rear endpoints of the audio are judged through semantics, the complex and manually set multi-model decision process is not required, meanwhile, the number of models and the deployment process are simplified, and a large number of resources are saved.

FIG. 6 shows a flow diagram of a model prediction phase according to an embodiment of the present application. Through the pre-training phase and the training phase of the model, a trained target detection model can be obtained, and the model can be deployed in a terminal device, as shown in fig. 6, a flow of the model prediction phase may include:

step S601, front end point detection stage.

In the front-end detection stage, the target detection model deployed on the terminal device is turned on by default, as shown in fig. 6 (a), at this time, the model may only run the encoder module and the voice behavior detection module to save the calculation amount when the model runs, the running encoder module and the voice behavior detection module may monitor voice in the background of the terminal device, and when the value of v (t) output by the voice behavior detection module is 1, it indicates that the front end point is reached (i.e., it is recognized that the current audio frame includes human voice), and the next stage, that is, the rear-end detection stage in step S602, may be entered.

The user may open the model by opening the application program supporting the voice detection in the terminal device to enter the front-end detection stage, and may also open the model by saying a wake-up word (e.g., "xiaozhi") to enter the front-end detection stage, which is not limited in this application.

For example, after entering the front end point detection stage, when the user speaks, the voice behavior detection module of the model may determine that the user is speaking (v (t) takes a value of 1), start to receive the sound, and at this time, may stop running the voice behavior detection module to save resources, and enter the rear end point detection stage. When the v (t) value output by the voice behavior detection module is 1, the voice behavior detection module can stop working to stop detection. The voice activity detection module may be restarted after detecting the user's wake-up instruction to detect the front-end point again.

Step S602, a back-end point detection stage.

In the post-endpoint detection stage, as shown in fig. 6 (b), at this time, the model may only run the encoder module, the prediction network module, the association network module and the semantic endpoint detection module, and stop running the speech behavior detection module, at this time, the encoder module may obtain a high-dimensional feature vector of the audio x (t) according to each frame of audio x (t) in the audio input when the user speaks, the prediction network module may obtain a high-dimensional feature vector of the text symbol y (u-1) according to the last text symbol y (u-1) identified by the association network module (where, for the first input text symbol y (u-1) when the prediction network module is running, a default value, such as a null character, may be taken), the association network module may identify the current text symbol y (u) corresponding to the user when the user speaks according to the high-dimensional feature vector of the audio x (t) and the high-dimensional feature vector of the text symbol y (u-1), the identified y (u) may be used as the next input of the prediction network module, i.e.g., the next input of the next text symbol y (u-1), the text symbol may be finally obtained according to judge whether the semantic endpoint detection module may be used to understand the content of the corresponding to the subsequent text symbol (y-1), and the speech behavior detection module, at this time, and judge whether the current semantic endpoint detection module may be used to determine whether the current semantic endpoint detection module, or, if y (u) is the last word of a sentence expressing complete semantics, when the value of e (u) output by the semantic endpoint detection module is 1, it indicates that the back endpoint is reached (i.e., it is determined that the sentence semantics are complete at this time), and the radio reception may be stopped to enter the next stage (for example, the stage of language understanding, etc., which is not limited by the present application).

For example, after entering the rear end point detection stage, when the user speaks and pauses, the semantic end point detection module can judge whether the semantics are complete, and under the condition that the semantics are incomplete, the radio reception can be continued until the semantics are complete or the time for the user to pause exceeds the preset time, the radio reception can be turned off, and the voice detection is finished, so that the condition that the radio reception is stopped in advance due to misjudgment can be prevented, the accuracy of the target detection model prediction is improved, and the user experience is improved.

FIG. 7 shows a flow diagram of a method of speech detection according to an embodiment of the application. The method can be used for a terminal device, as shown in fig. 7, and includes:

step S701, determining a second text symbol according to a first text symbol and a first audio signal of an audio signal sequence, wherein an initial value of the first text symbol is a null character, and the second text symbol corresponds to the content of the first audio signal;

step S702, according to the first text symbol, determining whether the semantic meaning of the second text symbol reaches a rear end point, wherein the rear end point represents the end of voice in an audio signal sequence;

step S703, in a case that the semantic meaning of the second text symbol does not reach the back end point, taking the second text symbol as a new first text symbol, taking the audio signal in the audio signal sequence after the first audio signal as a new first audio signal, and repeatedly executing the steps of determining the second text symbol and the following steps according to the first text symbol and the first audio signal in the audio signal sequence.

According to the embodiment of the application, the second text symbol is determined according to the first text symbol and the first audio signal of the audio signal sequence, and whether the semantics of the second text symbol reaches the rear end point is determined according to the first text symbol, so that whether the text symbol reaches the rear end point can be judged according to the semantics, the phenomenon that the user finishes receiving sound in advance when speaking a talk pause is avoided, misjudgment is prevented, the detection of the rear end point is more accurate, and the user experience is improved. Under the condition that the semantics of the second text symbol does not reach the rear end point, the second text symbol is taken as a new first text symbol, the audio signal in the audio signal sequence and behind the first audio signal is taken as a new first audio signal, and the second text symbol and the following steps are repeatedly executed according to the first text symbol and the first audio signal in the audio signal sequence to determine the second text symbol, so that the rear end point detection of the audio can be realized without depending on a complex and manually set multi-model decision process, the voice detection method is more flexible, and the accuracy of the voice detection method is also improved.

The first audio signal may include one audio frame, and may further include a plurality of audio frames, the content of the first audio signal may be the content of the user speaking, the end of speech may be the intention of the user not to continue speaking, one audio frame in the first audio signal may refer to x (t) in (b) of fig. 6, the first text symbol may refer to y (u-1) in (b) of fig. 6, and when the first second text symbol is predicted, because there is no previous text symbol, the first text symbol at this time may be a null character (i.e., the initial value is a null character). The second text symbol may refer to y (u) in fig. 6 (b), whether the semantics of the second text symbol reaches the back end point may be determined according to, for example, a value of e (u) in fig. 6 (b), the situation that the semantics of the second text symbol does not reach the back end point may include that the user speaks a word and pauses, and in the case that the semantics of the second text symbol reaches the back end point, the sound reception may be stopped, and subsequent operations (such as language understanding, dialog control, and the like) may be performed, which is not limited in this application.

An example of steps S701-S703 may refer to step S602 in fig. 6.

FIG. 8 shows a flow diagram of a method of speech detection according to an embodiment of the present application. As shown in fig. 8, the method further includes:

step S801, sequentially detecting whether audio frames contained in each audio signal in an audio signal sequence reach a front end point, wherein the front end point represents the start of voice in the audio signal sequence;

step S802, when detecting a first audio frame reaching the front end point, determining an audio signal where the audio frame is located as the first audio signal, and stopping the detection.

According to the embodiment of the application, whether the audio frames contained in the audio signals in the audio signal sequence reach the front end point or not is detected in sequence, when the first audio frame reaching the front end point is detected, the first audio signal where the audio frame is located is determined as the first audio signal, and the detection is stopped.

Each audio frame may or may not include a voice, whether the audio frame reaches the front end point may be determined according to, for example, a value of v (t) in fig. 6 (a), and the voice start may refer to a user starting to speak.

An example of steps S801-S802 may refer to step S601 in fig. 6.

In a possible implementation manner, the method may be applied to an object detection model, where the object detection model includes a speech behavior detection module, an association network module, a semantic endpoint detection module, an encoder module, and a prediction network module, and the encoder module is configured to obtain a first feature vector of an audio frame included in an audio signal sequence; the voice behavior detection module is used for determining whether audio frames contained in each audio signal in the audio signal sequence reach a front end point or not according to the first feature vector; the prediction network module is used for obtaining a second feature vector of the first text symbol; the associated network module is used for obtaining the second text symbol according to the first feature vector and the second feature vector; the semantic endpoint detection module is used for determining whether the semantics of the second text symbol reaches a rear endpoint according to the second feature vector.

The encoder module may be connected to the voice behavior detection module and the associated network module, the prediction network module may be connected to the associated network module and the semantic endpoint detection module, the voice behavior detection module may be a one-layer LSTM structure, the associated network module may be a one-layer fully-connected network structure, the semantic endpoint detection module may be a two-layer LSTM structure, the encoder module may be a six-layer transform structure, the prediction network module may be a twelve-layer transform structure, and the voice behavior detection module, the associated network module, the semantic endpoint detection module, the encoder module and the prediction network module may also be other structures. The first feature vector for example comprises a high-dimensional feature vector of the above audio and the second feature vector for example comprises a high-dimensional feature vector of the above text symbol.

Examples of the voice behavior detection module, the associated network module, the semantic endpoint detection module, the encoder module, and the predictive network module may be found in fig. 4.

In a possible implementation manner, the voice behavior detection module stops running when determining that an audio frame included in each audio signal in the audio signal sequence reaches a front end point according to the first feature vector.

FIG. 9 shows a flow diagram of a method of speech detection according to an embodiment of the application. The method can be used in the voice detection platform, as shown in fig. 9, the method further includes,

step S901, pre-training the prediction network module and the semantic endpoint detection module;

step S902, training a target detection model comprising a voice behavior detection module, an association network module, an encoder module, a pre-trained prediction network module and a pre-trained semantic endpoint detection module to obtain a trained target detection model.

According to the embodiment of the application, the prediction network module and the semantic endpoint detection module are pre-trained, the target detection module comprising the voice behavior detection module, the association network module, the encoder module, the pre-trained prediction network module and the pre-trained semantic endpoint detection module is trained, and the trained target detection module is obtained, so that the training process is more targeted, the rear endpoint based on semantic judgment is more accurate, and through pre-training, a subsequent training model can obtain a better training effect and the trained target detection model has higher prediction accuracy.

The method for pre-training and training is not limited, and the method for obtaining the target detection model according to the pre-trained prediction network module and the pre-trained semantic endpoint detection module is not limited, for example, the pre-trained prediction network module and the pre-trained semantic endpoint detection module can be spliced with the voice behavior detection module, the association network module and the encoder module.

An example of step S901 may be shown with reference to the flow of the model training method in the pre-training phase in fig. 2, and an example of step S902 may be shown with reference to the flow of the model training method in the overall training phase in fig. 4.

FIG. 10 shows a flow diagram of a method of speech detection according to an embodiment of the present application. As shown in fig. 10, the pre-training of the prediction network module and the semantic endpoint detection module includes:

step S1001, inputting a previous text symbol in a text sample into a prediction network module, and predicting by the prediction network module according to a feature vector of the previous text symbol to obtain a current text symbol;

step S1002, inputting the current text symbol into a semantic endpoint detection module, and determining whether the semantic of the current text symbol reaches a rear endpoint;

step S1003, calculating a first loss value of the prediction network module according to the label of the current text symbol, calculating a second loss value of the semantic endpoint detection module according to the label of whether the current text symbol reaches the rear endpoint, and performing parameter adjustment on the prediction network module and the semantic endpoint detection module according to the first loss value and the second loss value.

The text sample can be obtained by crawling or purchasing, and the method for obtaining the text sample is not limited in the application. The method of calculating the first loss value and the second loss value is not limited in this application, and for example, the first loss value may be calculated using cross-entropy loss and the second loss value may be calculated using binary cross-entropy loss. The previous text symbol in the text sample may be, for example, y (u-1) in fig. 2, the current text symbol may be, for example, y (u) in fig. 2, and whether the semantics of the current text symbol reaches the rear endpoint may be determined according to, for example, the value of e (u) in fig. 2.

Examples of steps S1001 to S1003 can be illustrated with reference to the flow of the model training method in the pre-training stage in fig. 2.

Fig. 11 shows a block diagram of a voice detection apparatus according to an embodiment of the present application. As shown in fig. 11, the apparatus includes:

a first determining module 1101, configured to determine a second text symbol according to a first text symbol and a first audio signal in an audio signal sequence, where an initial value of the first text symbol is a null character, and the second text symbol corresponds to a content of the first audio signal;

a second determining module 1102, configured to determine, according to the first text symbol, whether the semantics of the second text symbol reaches a back end point, where the back end point represents the end of speech in an audio signal sequence;

a third determining module 1103, configured to, in a case that the semantic meaning of the second text symbol does not reach the back endpoint, use the second text symbol as a new first text symbol, use an audio signal in the audio signal sequence after the first audio signal as a new first audio signal, and repeatedly perform the steps of determining the second text symbol and the subsequent steps according to the first text symbol and the first audio signal in the audio signal sequence.

In one possible implementation, the apparatus further includes: the detection module is used for sequentially detecting whether audio frames contained in each audio signal in the audio signal sequence reach a front end point or not, wherein the front end point represents the beginning of voice in the audio signal sequence; and the fourth determining module is used for determining the audio signal where the audio frame is positioned as the first audio signal and stopping detection when the first audio frame reaching the front end point is detected.

In a possible implementation manner, the apparatus is used for a target detection model, where the target detection model includes a speech behavior detection module, an association network module, a semantic endpoint detection module, an encoder module, and a prediction network module, and the encoder module is used to obtain a first feature vector of an audio frame included in an audio signal sequence; the voice behavior detection module is used for determining whether audio frames contained in each audio signal in the audio signal sequence reach a front end point or not according to the first feature vector; the prediction network module is used for obtaining a second feature vector of the first text symbol; the associated network module is used for obtaining the second text symbol according to the first characteristic vector and the second characteristic vector; the semantic endpoint detection module is used for determining whether the semantics of the second text symbol reaches a rear endpoint according to the second feature vector.

In one possible implementation, the apparatus further includes: the pre-training module is used for pre-training the prediction network module and the semantic endpoint detection module; and the training module is used for training a target detection model comprising the voice behavior detection module, the association network module, the coder module, the pre-trained prediction network module and the pre-trained semantic endpoint detection module to obtain the trained target detection model.

In one possible implementation, the pre-training module is configured to: inputting a previous text symbol in a text sample into a prediction network module, and predicting by the prediction network module according to a feature vector of the previous text symbol to obtain a current text symbol; inputting the current text symbol into a semantic endpoint detection module, and determining whether the semantics of the current text symbol reaches a rear endpoint; calculating a first loss value of the prediction network module according to the label of the current text symbol, calculating a second loss value of the semantic endpoint detection module according to the label of whether the current text symbol reaches the rear endpoint, and performing parameter adjustment on the prediction network module and the semantic endpoint detection module according to the first loss value and the second loss value.

In a possible implementation manner, the voice behavior detection module stops operation when it is determined that an audio frame included in each audio signal in the audio signal sequence reaches a front end point according to the first feature vector.

Fig. 12 shows a schematic structural diagram of a terminal device according to an embodiment of the present application. Taking the terminal device as a mobile phone as an example, fig. 12 shows a schematic structural diagram of a mobile phone 200.

The mobile phone 200 may include a processor 210, an external memory interface 220, an internal memory 221, a usb interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 251, a wireless communication module 252, an audio module 270, a speaker 270A, a receiver 270B, a microphone 270C, an earphone interface 270D, a sensor module 280, a key 290, a motor 291, an indicator 292, a camera 293, a display 294, a SIM card interface 295, and the like. The sensor module 280 may include a gyroscope sensor 280A, an acceleration sensor 280B, a proximity light sensor 280G, a fingerprint sensor 280H, and a touch sensor 280K (of course, the mobile phone 200 may also include other sensors, such as a temperature sensor, a pressure sensor, a distance sensor, a magnetic sensor, an ambient light sensor, an air pressure sensor, a bone conduction sensor, and the like, which are not shown in the figure).

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the mobile phone 200. In other embodiments of the present application, handset 200 may include more or fewer components than shown, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 210 may include one or more processing units, such as: the processor 210 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a Neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors. Wherein the controller can be the neural center and the command center of the cell phone 200. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 210 for storing instructions and data. In some embodiments, the memory in processor 210 is a cache memory. The memory may hold instructions or data that have just been used or recycled by processor 210. If the processor 210 needs to use the instruction or data again, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 210, thereby increasing the efficiency of the system.

The processor 210 may execute the voice detection method provided in the embodiment of the present application, so as to detect the rear endpoint based on the semantics and detect the front endpoint of the audio to accurately determine the vocal range of the audio. When the processor 210 may include different devices, for example, a CPU and a GPU are integrated, and the CPU and the GPU may cooperate to execute the voice detection method provided in the embodiment of the present application, for example, part of algorithms in the voice detection method is executed by the CPU, and another part of algorithms is executed by the GPU, so as to obtain faster processing efficiency.

Internal memory 221 may be used to store computer-executable program code, including instructions. The processor 210 executes various functional applications and data processing of the cellular phone 200 by executing instructions stored in the internal memory 221. The internal memory 221 may include a program storage area and a data storage area. The storage program area may store an operating system, codes of application programs (such as a camera application, a WeChat application, and the like), and the like. The data storage area can store data (such as images, videos and the like acquired by a camera application) and the like created in the use process of the mobile phone 200.

The internal memory 221 may further store one or more computer programs 1310 corresponding to the voice detection method provided by the embodiment of the present application. The one or more computer programs 1304 are stored in the memory 221 and configured to be executed by the one or more processors 210, the one or more computer programs 1310 including instructions that may be used to perform the steps in the respective embodiments of fig. 2, 4, 6, 7-10, the computer programs 1310 may include a first determining module 1101, a second determining module 1102, and a third determining module 1103. The first determining module 1101 is configured to determine a second text symbol according to a first text symbol and a first audio signal in an audio signal sequence, where an initial value of the first text symbol is a null character, the second text symbol corresponds to a content of the first audio signal, the second determining module 1102 is configured to determine, according to the first text symbol, whether a semantic meaning of the second text symbol reaches a back end point, where the back end point represents an end of speech in the audio signal sequence, and the third determining module 1103 is configured to, in a case that the semantic meaning of the second text symbol does not reach the back end point, determine the second text symbol and subsequent steps according to the first text symbol and the first audio signal in the audio signal sequence by taking the second text symbol as a new first text symbol and by taking an audio signal subsequent to the first audio signal in the audio signal sequence as a new first audio signal. When the code of the transmission method of the data stored in the internal memory 221 is executed by the processor 210, the processor 210 may control the display screen to display the prediction result of the target model.

In addition, the internal memory 221 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

Of course, the code of the voice detection method provided by the embodiment of the present application may also be stored in the external memory. In this case, the processor 210 may execute the code of the voice detection method stored in the external memory through the external memory interface 220.

The wireless communication function of the mobile phone 200 can be implemented by the antenna 1, the antenna 2, the mobile communication module 251, the wireless communication module 252, the modem processor, the baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the handset 200 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 251 can provide a solution including wireless communication of 2G/3G/4G/5G, etc. applied to the mobile phone 200. The mobile communication module 251 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 251 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit the electromagnetic waves to the modem for demodulation. The mobile communication module 251 can also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave to radiate the electromagnetic wave through the antenna 1. In some embodiments, at least some of the functional modules of the mobile communication module 251 may be disposed in the processor 210. In some embodiments, at least some of the functional modules of the mobile communication module 251 may be provided in the same device as at least some of the modules of the processor 210. In this embodiment, the mobile communication module 251 may be further configured to perform information interaction with other electronic devices to obtain a trained target detection model.

The wireless communication module 252 may provide solutions for wireless communication applied to the mobile phone 200, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (BT), global Navigation Satellite System (GNSS), frequency Modulation (FM), near Field Communication (NFC), infrared (IR), and the like. The wireless communication module 252 may be one or more devices that integrate at least one communication processing module. The wireless communication module 252 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 210. The wireless communication module 252 may also receive a signal to be transmitted from the processor 210, frequency modulate it, amplify it, and convert it into electromagnetic waves via the antenna 2 to radiate it. In this embodiment, the wireless communication module 252 is configured to transmit data with other electronic devices under the control of the processor 210, for example, when the processor 210 executes the voice detection method provided in this embodiment, the processor may control the wireless communication module 252 to receive a trained target detection model sent by the electronic device.

In addition, the mobile phone 200 can implement an audio function through the audio module 270, the speaker 270A, the receiver 270B, the microphone 270C, the earphone interface 270D, and the application processor. Such as receiving audio input by a user, performing a voice broadcast, etc. It should be understood that in practical applications, the mobile phone 200 may include more or less components than those shown in fig. 12, and the embodiment of the present application is not limited thereto. The illustrated handset 200 is merely an example, and the handset 200 may have more or fewer components than shown in the figures, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

FIG. 13 shows a block diagram of an electronic device according to an embodiment of the application. As shown in fig. 13, the electronic device 40 includes at least one processor 1801, at least one memory 1802, and at least one communication interface 1803. In addition, the electronic device may further include general components such as an antenna, which will not be described in detail herein.

Through the electronic device shown in the embodiment of the application, the model can be pre-trained as shown in fig. 2 and integrally trained as shown in fig. 4 to obtain a trained target detection model, and after training, the target detection model can be exported and deployed to the terminal device as shown in fig. 12 through a model format conversion tool.

The processor 1801 may be a general purpose Central Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to control the execution of programs according to the above schemes. The processor 1801 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processor (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), among others. Wherein, the different processing units may be independent devices or may be integrated in one or more processors.

Communication interface 1803 may be adapted to communicate with other devices or a communication network, such as an ethernet network, a Radio Access Network (RAN), a core network, a Wireless Local Area Network (WLAN), etc.

The Memory 1802 may be, but is not limited to, a Read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable Read-Only Memory (EEPROM), a Compact Disc Read-Only Memory (CD-ROM) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integrated with the processor.

The memory 1802 is used for storing application program codes for executing the above schemes, and the execution of the application program codes is controlled by the processor 1801. The processor 1801 is configured to execute application code stored in the memory 1802.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

An embodiment of the present application provides a voice detection apparatus, including: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to implement the above method when executing the instructions.

Embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

Embodiments of the present application provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The embodiment of the application provides a terminal device, and the terminal device can execute the method.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable Programmable Read-Only Memory (EPROM or flash Memory), a Static Random Access Memory (SRAM), a portable Compact Disc Read-Only Memory (CD-ROM), a Digital Versatile Disc (DVD), a Memory stick, a floppy disk, a mechanical coding device, a punch card or an in-groove protrusion structure, for example, having instructions stored thereon, and any suitable combination of the foregoing.

The computer readable program instructions or code described herein may be downloaded from a computer readable storage medium to a respective computing/processing device, or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present application may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the internet using an internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement aspects of the present application by utilizing state information of the computer-readable program instructions to personalize custom electronic circuitry, such as Programmable Logic circuits, field-Programmable Gate arrays (FPGAs), or Programmable Logic Arrays (PLAs).

Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

It is also noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by hardware (e.g., a Circuit or an ASIC) for performing the corresponding function or action, or by combinations of hardware and software, such as firmware.

While the invention has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

The foregoing description of the embodiments of the present application has been presented for purposes of illustration and description and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for speech detection, the method comprising:

determining a second text symbol according to a first text symbol and a first audio signal in an audio signal sequence, wherein the initial value of the first text symbol is a null character, and the second text symbol corresponds to the content of the first audio signal;

determining whether the semantics of the second text symbol reach a rear endpoint according to the first text symbol, wherein the rear endpoint represents the end of speech in the audio signal sequence;

and under the condition that the semantic meaning of the second text symbol does not reach a rear endpoint, taking the second text symbol as a new first text symbol, taking the audio signal in the audio signal sequence after the first audio signal as a new first audio signal, and repeatedly executing the steps of determining the second text symbol and the following steps according to the first text symbol and the first audio signal in the audio signal sequence.

2. The method of claim 1, further comprising:

sequentially detecting whether audio frames contained in each audio signal in the audio signal sequence reach a front end point, wherein the front end point represents the beginning of voice in the audio signal sequence;

and when detecting the first audio frame reaching the front end point, determining the audio signal where the audio frame is positioned as the first audio signal, and stopping detection.

3. The method of claim 2, wherein the method is used in an object detection model comprising a speech behavior detection module, an association network module, a semantic endpoint detection module, an encoder module, and a prediction network module,

the encoder module is used for obtaining a first feature vector of an audio frame contained in an audio signal sequence;

the voice behavior detection module is used for determining whether audio frames contained in each audio signal in the audio signal sequence reach a front end point or not according to the first feature vector;

the prediction network module is used for obtaining a second feature vector of the first text symbol;

the associated network module is used for obtaining the second text symbol according to the first feature vector and the second feature vector;

the semantic endpoint detection module is used for determining whether the semantics of the second text symbol reaches a rear endpoint according to the second feature vector.

4. The method of claim 3, further comprising:

pre-training the prediction network module and the semantic endpoint detection module;

and training a target detection model comprising the voice behavior detection module, the associated network module, the encoder module, the pre-trained prediction network module and the pre-trained semantic endpoint detection module to obtain a trained target detection model.

5. The method of claim 3, wherein pre-training the predictive network module and the semantic endpoint detection module comprises:

inputting a previous text symbol in a text sample into a prediction network module, and predicting by the prediction network module according to a feature vector of the previous text symbol to obtain a current text symbol;

inputting the current text symbol into a semantic endpoint detection module, and determining whether the semantics of the current text symbol reaches a rear endpoint;

calculating a first loss value of the prediction network module according to the label of the current text symbol, calculating a second loss value of the semantic endpoint detection module according to the label of whether the current text symbol reaches the rear endpoint, and performing parameter adjustment on the prediction network module and the semantic endpoint detection module according to the first loss value and the second loss value.

6. The method according to claim 3, wherein the speech behavior detection module stops operation when determining from the first feature vector that the audio frames included in each audio signal in the sequence of audio signals reach a front end point.

7. A speech detection apparatus, characterized in that the apparatus comprises:

the device comprises a first determining module, a second determining module and a control module, wherein the first determining module is used for determining a second text symbol according to a first text symbol and a first audio signal in an audio signal sequence, the initial value of the first text symbol is a null character, and the second text symbol corresponds to the content of the first audio signal;

a second determining module, configured to determine, according to the first text symbol, whether a semantic meaning of the second text symbol reaches a back endpoint, where the back endpoint represents an end of speech in an audio signal sequence;

and a third determining module, configured to, in a case that the semantic meaning of the second text symbol does not reach a back endpoint, take the second text symbol as a new first text symbol, take an audio signal subsequent to the first audio signal in the audio signal sequence as a new first audio signal, and repeatedly perform the steps of determining the second text symbol and subsequent steps according to the first text symbol and the first audio signal in the audio signal sequence.

8. A speech detection apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any of claims 1-6 when executing the instructions.

9. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1-6.

10. A computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the method of any of claims 1-6.