CN112634907B

CN112634907B - Audio data processing method and device for voice recognition

Info

Publication number: CN112634907B
Application number: CN202011543521.3A
Authority: CN
Inventors: 罗海斯·马尔斯; 胡正倫
Original assignee: Bigo Technology Pte Ltd
Current assignee: Bigo Technology Pte Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2024-05-17
Anticipated expiration: 2040-12-24
Also published as: CN112634907A

Abstract

The embodiment of the invention discloses an audio data processing method, device, equipment and storage medium for voice recognition, wherein the method comprises the following steps: determining whether the audio bitstream stored in the buffer is greater than a target detection length or not; determining a corresponding end point detector to detect the end point of the audio bit stream according to the determined result of whether the audio bit stream is larger than the maximum detection length, wherein the end point detector comprises a silence descriptor detector, a first end point detector based on a long-term network model and a second end point detector based on a short-term network model; and processing the audio bit stream according to the end point detection result to obtain an audio sample for voice recognition. The scheme avoids the problem that the recognition performance is reduced due to the fact that the continuous voice fragments are divided into different parts for voice recognition, so that the voice region cannot be damaged, and the efficiency and the accuracy of voice recognition are remarkably improved.

Description

Audio data processing method and device for voice recognition

Technical Field

The embodiment of the application relates to the field of computers, in particular to an audio data processing method and device for voice recognition.

Background

When audio transmission is performed using an audio stream, the audio stream needs to be divided for speech recognition. The audio streaming mode mainly comprises the mode of transmitting by adopting an audio frame mode and the mode of transmitting by adopting an audio data block mode between a client and a server.

In the transmission mode of adopting audio frames, an audio frame sequence is sent from an audio sending end to an audio receiving end, the receiving end decodes the audio frame sequence and inputs the audio frame sequence to a voice recognizer one by one for voice recognition, and the mode cannot realize batch processing, so that the voice recognition efficiency is lower.

In a transmission scheme employing blocks of audio data, each block of audio data is transmitted as an audio segment between a client and a server. In the prior art, an audio stream is divided into audio segments with a fixed length, as shown in fig. 1, fig. 1a is a schematic diagram of audio data segmentation in the prior art, where the audio stream is divided at time intervals of 20s, and the flexibility of the division manner is poor, which may cause that continuous speech segments are segmented during speech recognition, thereby reducing the speech recognition efficiency.

Disclosure of Invention

The embodiment of the invention provides an audio data processing method, device, equipment and storage medium for voice recognition, which avoid the problem of reduced recognition performance caused by the fact that continuous voice fragments are divided into different parts for voice recognition, prevent voice regions from being damaged and obviously improve the efficiency and accuracy of voice recognition.

In a first aspect, an embodiment of the present invention provides an audio data processing method for speech recognition, the method including:

determining whether the audio bitstream stored in the buffer is greater than a target detection length or not;

Determining a corresponding end point detector to detect the end point of the audio bit stream according to the determined result of whether the audio bit stream is larger than the maximum detection length, wherein the end point detector comprises a silence descriptor detector, a first end point detector based on a long-term network model and a second end point detector based on a short-term network model;

And processing the audio bit stream according to the end point detection result to obtain an audio sample for voice recognition.

In a second aspect, an embodiment of the present invention further provides an audio data processing apparatus for speech recognition, the apparatus including:

an audio length detection module, configured to determine whether an audio bit stream stored in a buffer is greater than a target detection length, when the audio bit stream is greater than a maximum detection length;

an audio endpoint detection module, configured to determine, according to a result of determining whether the audio bitstream is greater than a maximum detection length, that a corresponding endpoint detector performs endpoint detection on the audio bitstream, where the endpoint detector includes a silence descriptor detector, a first endpoint detector based on a long-term network model, and a second endpoint detector based on a short-term network model;

And the audio processing module is used for processing the audio bit stream according to the end point detection result to obtain an audio sample for voice recognition.

In a third aspect, an embodiment of the present invention further provides an audio data processing apparatus for speech recognition, the apparatus comprising:

one or more processors;

Storage means for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the audio data processing method for speech recognition according to the embodiments of the present invention.

In a fourth aspect, embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the audio data processing method for speech recognition according to the embodiments of the present invention.

In the embodiment of the invention, when the audio bit stream stored in the buffer area is larger than the target detection length, whether the audio bit stream is larger than the maximum detection length is determined, and according to the result of determining whether the audio bit stream is larger than the maximum detection length, a corresponding end point detector is determined to detect the end point of the audio bit stream, wherein the end point detector comprises a silence descriptor detector, a first end point detector based on a long-term network model and a second end point detector based on a short-term network model, and according to the result of the end point detection, the audio bit stream is processed to obtain an audio sample for voice recognition. The scheme avoids the problem that the recognition performance is reduced due to the fact that the continuous voice fragments are divided into different parts for voice recognition, so that the voice region cannot be damaged, and the efficiency and the accuracy of voice recognition are remarkably improved.

Drawings

FIG. 1a is a schematic diagram of an audio data segment according to the prior art;

Fig. 1 is a flowchart of an audio data processing method according to an embodiment of the present invention;

FIG. 1b is a schematic diagram of a mixture of speech information and fixed noise information provided by the present invention;

FIG. 1c is a schematic diagram of a speech information and non-stationary noise mixture provided by the present invention;

FIG. 1d is a schematic diagram illustrating endpoint detection for an audio bitstream according to an embodiment of the present invention;

FIG. 2 is a flowchart of another audio data processing method according to an embodiment of the present invention;

FIG. 3 is a flowchart of another audio data processing method according to an embodiment of the present invention;

FIG. 3a is a schematic diagram of a mark for endpoint detection of audio data by a first endpoint detector based on a long-term network model according to an embodiment of the present invention;

FIG. 4 is a flowchart of another audio data processing method according to an embodiment of the present invention;

FIG. 4a is a schematic diagram of audio information buffering provided in an embodiment of the present invention;

fig. 5 is a block diagram of an audio data processing device according to an embodiment of the present invention;

Fig. 6 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the drawings and examples. It should be understood that the particular embodiments described herein are illustrative only and are not limiting of embodiments of the invention. It should be further noted that, for convenience of description, only some, but not all of the structures related to the embodiments of the present invention are shown in the drawings.

Fig. 1 is a flowchart of an audio data processing method according to an embodiment of the present invention, where the method may be applied to a speech recognition process, and the method may be performed by a computing device, such as a mobile phone, a notebook, an iPad, a server, a desktop, and the like, and specifically includes the following steps:

step S101, when the audio bitstream stored in the buffer is greater than the target detection length, determining whether the audio bitstream is greater than the maximum detection length.

In one embodiment, the audio data is encoded by the transmitting end in an audio bitstream manner and then transmitted to the receiving end, and the receiving end stores the received audio bitstream in a buffer area to perform batch processing operation for speech recognition, so as to obtain corresponding speech content. In the voice recognition process, in order to improve the real-time performance of voice recognition and avoid the problem of low recognition accuracy caused by too short recognition audio bit stream, a target detection length and a maximum detection length are set. By way of example, the target detection length may be a 20 second audio bitstream and the maximum detection length may be a 30 second audio bitstream.

And after the received audio bit stream is stored in the buffer area, detecting the length of the audio bit stream, and judging whether the length of the received audio bit stream is larger than the maximum detection length or not when the length of the received audio bit stream is larger than the target detection length, so as to segment the received audio bit stream according to the situation.

Step S102, according to the determined result of whether the audio bit stream is larger than the maximum detection length, determining that the corresponding end point detector carries out end point detection on the audio bit stream.

In one embodiment, three endpoint detectors are provided to detect endpoints of the audio bitstream for speech recognition after segmentation thereof. In the speech recognition process, in order to improve recognition efficiency, reasonable segmentation needs to be performed on the audio information, and a conventional segmentation method is to segment an audio bit stream by using a fixed time interval such as 20s, and the continuous speech frame information is divided into different segments, thereby bringing the problems of poor speech recognition accuracy and low performance.

The scheme does not adopt a mode of segmenting at fixed time intervals, but adopts a mode of cascading three end point detectors, namely a silence descriptor detector, a first end point detector based on a long-term network model and a second end point detector based on a short-term network model. The silence descriptor detector detects silence descriptors (SID values) of the audio bitstream, and specifically, in some cases, the audio bitstream sender marks non-speech information, i.e., silence information, for example, adds silence descriptors to indicate that a subsequent audio frame is a non-speech frame, and when sid=0, it is used to indicate that the subsequent audio frame is a non-speech frame, i.e., the end point of the current speech frame; when the transmitting end starts to transmit voice frame information, a silence descriptor is added to make sid=1 to indicate that the subsequent audio frame is a voice frame. But in some cases, such as non-stationary noise scenarios, this statistical approach will significantly decrease. As shown in fig. 1b, fig. 1b is a schematic diagram of a mixture of speech information and fixed noise information provided by the present invention, and as can be seen from fig. 1b, it is easy to distinguish between speech frames and non-speech frames. Fig. 1c is a schematic diagram of the mixing of speech information and non-stationary noise provided by the present invention, and it can be seen from fig. 1c that it is difficult to distinguish speech frames from non-speech frames by using a conventional statistical method.

Correspondingly, the scheme is also provided with a first end point detector based on a long-term network model and a second end point detector based on a short-term network model. The short-term network model consists of a convolution layer and a full connection layer, and the long-term network model consists of a convolution layer output, a recursion layer and a full connection layer of the second end point detector.

The first endpoint detector based on the long-term network model is used for effectively distinguishing voice from high-energy non-voice, has strong association and memory characteristics, and has high detection precision. The correlation and memory characteristics of the second end point detector based on the short-term network model are stronger than those of the first end point detector, the detection accuracy is moderate, and the method is more suitable for distinguishing voice frames and non-voice frames of audio information with longer length.

In one embodiment, the different endpoint detectors are dynamically configured to select the endpoint corresponding to the audio bitstream based on a result of determining whether the audio bitstream is greater than a maximum detected length.

And step 103, processing the audio bit stream according to the end point detection result to obtain an audio sample for voice recognition.

In one embodiment, endpoint detection is performed at a silence descriptor detector selected for different audio bitstream lengths, a first endpoint detector based on a long-term network model, and a second endpoint detector based on a short-term network model to distinguish between speech information and non-speech information corresponding to the audio bitstream. As shown in fig. 1d, fig. 1d is a schematic diagram of endpoint detection for an audio bitstream according to an embodiment of the present invention, and as can be seen from fig. 1d, detected endpoints are endpoint 1 and endpoint 2, and then speech frame information between the endpoint 1 and the endpoint 2 is divided to obtain audio samples to be speech-recognized.

In another embodiment, if no endpoint is detected in the current audio bitstream, i.e. the information corresponding to the current audio bitstream is all speech information, the entirety thereof is buffered as audio samples for subsequent speech recognition.

According to the scheme, according to the determined result of whether the audio bit stream is larger than the maximum detection length, the corresponding end point detector is determined to detect the end point of the audio bit stream, the end point detector comprises a silence descriptor detector, a first end point detector based on a long-term network model and a second end point detector based on a short-term network model, and according to the end point detection result, the audio bit stream is processed to obtain an audio sample for voice recognition, so that the flexibility of audio information segmentation is improved, and the efficiency and the accuracy of voice recognition are remarkably improved.

Fig. 2 is a flowchart of another audio data processing method according to an embodiment of the present invention, which provides a specific method for determining an endpoint detector to detect an endpoint of an audio bitstream according to a result of determining whether the result is greater than a maximum detection length. As shown in fig. 2, the technical scheme is as follows:

Step S201, when the audio bit stream stored in the buffer area is larger than the target detection length, determining whether the audio bit stream is larger than the maximum detection length.

Step S202, if the audio bitstream is not greater than the maximum detection length, performing endpoint detection on the audio bitstream by a silence descriptor detector and a first endpoint detector based on a long-term network model.

In one embodiment, when it is determined that the audio bitstream is not greater than the maximum detection length, endpoint detection is performed using a silence descriptor detector that is well-detected and a first endpoint detector that is based on a long-term network model.

Specifically, it is first detected whether the silence identifier exists in the audio bitstream, if so, the endpoint is determined according to the indication of the silence identifier, and if so, that is, sid=0 is detected, the audio bitstream after the silence identifier is determined to be a non-speech frame for segmentation.

If the silence identifier does not exist or the SID value is detected to be 1, the audio bit stream is decoded, the decoded voice data is input to a first end point detector based on a long-term network model, and end point detection is performed through the first end point detector of the long-term network model. If a non-voice information endpoint is detected, corresponding audio information segmentation is carried out to obtain an audio sample for voice recognition; if no non-voice information endpoint is detected, the audio information as a whole is cached as an audio sample.

And step 203, processing the audio bit stream according to the end point detection result to obtain an audio sample for voice recognition.

According to the scheme, if the audio bit stream is not larger than the maximum detection length, the silence descriptor detector and the first end point detector based on the long-term network model are used for detecting the end points of the audio bit stream so as to reasonably identify non-voice frames, and the audio information is segmented based on the non-voice frames, so that the problems of low recognition precision and efficiency caused by the fact that continuous voice information is segmented into different parts for voice recognition are avoided.

Fig. 3 is a flowchart of another audio data processing method according to an embodiment of the present invention, and shows a specific method for endpoint detection of the audio bitstream by a silence descriptor detector and a first endpoint detector based on a long-term network model. As shown in fig. 3, the technical scheme is as follows:

step S301, detecting the length of the audio bit stream stored in the buffer.

Step S301, judging whether the length of the audio bit stream is larger than the target detection length, if so, executing step S302, otherwise, continuing executing step S301.

Step S303, judging whether the audio bit stream is larger than the maximum detection length, if not, executing step S304.

Step S304, detecting silence descriptors in the audio bitstream.

Step S305, determining whether a silence descriptor is detected, if yes, executing step S306, otherwise executing step S307.

Step S306, determining the endpoint of the audio bit stream according to the position of the silence descriptor.

Step S307, performing endpoint detection on the audio bitstream by using a first endpoint detector based on the long-term network model, and determining the detected endpoint as an endpoint of the audio bitstream.

In one embodiment, the audio bitstream is decoded and input to a first endpoint detector based on a long-term network model, and a plurality of audio frames with preset lengths are marked by the first endpoint detector based on the long-term network model, and the endpoints are determined according to the frame group marking condition, wherein the voice information frame group is marked with 1, the non-voice information frame group is marked with 0. As shown in fig. 3a, fig. 3a is a schematic diagram of a mark for endpoint detection of audio data by a first endpoint detector based on a long-term network model according to an embodiment of the present invention, and as can be seen from the figure, the first area is a non-speech frame area, and the second area is a speech frame area, so that the audio information is segmented based on the endpoint a to obtain an audio sample (second area) for subsequent speech recognition.

And step 308, processing the audio bit stream according to the end point detection result to obtain an audio sample for voice recognition.

According to the scheme, the first end point detector based on the long-term network model is used for carrying out end point detection on the audio bit stream, and effectively distinguishing voice information from non-voice information, so that corresponding segmentation processing is carried out, the problem of voice recognition errors caused by splitting the voice information is avoided, and the voice or other accuracy is improved.

Fig. 4 is a flowchart of another audio data processing method for speech recognition according to an embodiment of the present invention, and a specific processing method for an audio bitstream with a length greater than a maximum detection length is provided. As shown in fig. 4, the technical scheme is as follows:

Step S401, detecting the length of the audio bit stream stored in the buffer.

Step S401, judging whether the length of the audio bit stream is larger than the target detection length, if so, executing step S402, otherwise, continuing executing step S401.

Step S403, judging whether the audio bit stream is larger than the maximum detection length, if not, executing step S404, and if so, executing step S411.

Step S404, detecting silence descriptors in the audio bitstream.

Step S405, determining whether a silence descriptor is detected, if yes, executing step S406, otherwise executing step S407.

Step S406, determining the endpoint of the audio bit stream according to the position of the silence descriptor, and performing audio segmentation to obtain an audio sample.

Step S407, decoding the audio bitstream, and performing endpoint detection on the decoded audio information based on the first endpoint detector of the long-term network model.

Step S408, determining whether an endpoint is detected, if yes, executing step S409, otherwise executing step S410.

And step S409, determining the detected end points as the end points of the audio bit stream, and performing audio segmentation to obtain audio samples.

Step S410, the audio information is cached and stored.

Step S411, performing endpoint detection on the audio bitstream by a second endpoint detector based on a short-term network model.

In one embodiment, when it is determined that the audio bitstream is greater than the maximum detection length, the audio bitstream is end-detected using the second end-point detector based on the short-term network model. The accuracy and efficiency of endpoint detection are improved by utilizing the characteristics of the network model.

Step S412, determining whether an endpoint is detected, if so, executing step S409, otherwise executing step S410.

According to the scheme, the received audio bit stream is segmented in a cascade mode of a plurality of end point detectors, the appropriate end point detectors are selected for different lengths of the input audio bit stream to carry out end point detection, if the audio bit stream is not larger than the maximum detection length, the audio bit stream is detected through the silence descriptor detector and the first end point detector based on the long-term network model, if the audio bit stream is larger than the maximum detection length, the audio bit stream is detected through the second end point detector based on the short-term network model, and the detected end points are determined to be the end points of the audio bit stream, so that reasonable audio segmentation to generate audio samples is realized, and the accuracy of subsequent voice recognition is improved.

On the basis of the technical scheme, if the endpoint is not detected in the audio information decoded by the current audio bit stream, the endpoint is not segmented and is cached. Fig. 4a is a schematic diagram of audio information buffering provided by an embodiment of the present invention, where, as shown in fig. 4a, audio information x is decoding information corresponding to one input audio bit stream, audio information y is decoding information corresponding to another input audio bit stream, if no endpoint is detected in the audio information x and the audio information y, the audio information x and the audio information y are voice information, and after the audio information x and the audio information y are sequentially buffered, the input value voice recognizer performs voice recognition. The audio processing mode ensures that continuous voice information is not segmented, and improves voice recognition performance.

Fig. 5 is a block diagram of an audio data processing device according to an embodiment of the present invention, where the device is configured to execute the audio data processing method for speech recognition according to the foregoing embodiment, and the device has functional modules and beneficial effects corresponding to the execution method. As shown in fig. 5, the apparatus specifically includes: an audio length detection module 101, an audio endpoint detection module 102, and an audio processing module 103, wherein,

An audio length detection module 101, configured to determine, when an audio bit stream stored in a buffer is greater than a target detection length, whether the audio bit stream is greater than a maximum detection length;

An audio endpoint detection module 102, configured to determine, according to a result of determining whether the audio bitstream is greater than a maximum detection length, that a corresponding endpoint detector performs endpoint detection on the audio bitstream, where the endpoint detector includes a silence descriptor detector, a first endpoint detector based on a long-term network model, and a second endpoint detector based on a short-term network model;

And the audio processing module 103 is configured to process the audio bitstream according to the end point detection result to obtain an audio sample for speech recognition.

As can be seen from the above-described scheme, when the audio bitstream stored in the buffer is greater than the target detection length, determining whether the audio bitstream is greater than the maximum detection length; determining a corresponding end point detector to detect the end point of the audio bit stream according to the determined result of whether the audio bit stream is larger than the maximum detection length, wherein the end point detector comprises a silence descriptor detector, a first end point detector based on a long-term network model and a second end point detector based on a short-term network model; and processing the audio bit stream according to the end point detection result to obtain an audio sample for voice recognition. The scheme avoids the problem that the recognition performance is reduced due to the fact that the continuous voice fragments are divided into different parts for voice recognition, so that the voice region cannot be damaged, and the efficiency and the accuracy of voice recognition are remarkably improved.

In one possible embodiment, the audio endpoint detection module 102 is specifically configured to:

If the audio bitstream is not greater than the maximum detection length, the audio bitstream is end-detected by a silence descriptor detector and a first end-point detector based on a long-term network model.

Silence descriptors in the audio bitstream are detected, and if the silence descriptors are detected, endpoints of the audio bitstream are determined according to the positions of the silence descriptors.

If the silence descriptor is not detected, a first endpoint detector based on a long-term network model performs endpoint detection on the audio bitstream, determining the detected endpoint as an endpoint of the audio bitstream.

In one possible embodiment, the long-term network model consists of a convolutional layer output, a recursive layer, and a fully-connected layer of the second endpoint detector.

if the audio bitstream is greater than the maximum detection length, the audio bitstream is end-detected by a second end-point detector based on a short-term network model, and the detected end-point is determined as an end-point of the audio bitstream.

In one possible embodiment, the short-term network model consists of a convolutional layer and a fully-connected layer.

In one possible embodiment, the audio processing module 103 is specifically configured to:

If an endpoint is detected in the audio bitstream, segmentation and decoding processes are performed according to the location of the endpoint, generating a plurality of audio samples for speech recognition.

If no endpoint is detected in the audio bitstream, decoding and buffering the audio bitstream.

Fig. 6 is a schematic structural diagram of an audio data processing device according to an embodiment of the present invention, and as shown in fig. 6, the device includes a processor 201, a memory 202, an input device 203, and an output device 204; the number of processors 201 in the device may be one or more, one processor 201 being taken as an example in fig. 6; the processor 201, memory 202, input devices 203, and output devices 204 in the apparatus may be connected by a bus or other means, for example in fig. 6. The memory 202 is a computer readable storage medium, and may be used to store software programs, computer executable programs, and modules, such as program instructions/modules corresponding to the audio data processing method for speech recognition in the embodiment of the present invention. The processor 201 executes various functional applications of the device and data processing, i.e., implements the above-described audio data processing method for speech recognition, by running software programs, instructions, and modules stored in the memory 202. The input means 203 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the device. The output device 204 may include a display device such as a display screen.

Embodiments of the present invention also provide a storage medium containing computer executable instructions, which when executed by a computer processor, are for performing an audio data processing method for speech recognition, the method comprising:

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments of the present invention may be implemented by software and necessary general purpose hardware, and of course may be implemented by hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the embodiments of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, where the instructions include a number of instructions for causing a computer device (which may be a personal computer, a service, or a network device, etc.) to perform the method according to the embodiments of the present invention.

It should be noted that, in the above-described embodiment of the audio data processing apparatus for voice recognition, each unit and module included are merely divided according to the functional logic, but are not limited to the above-described division, as long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the embodiments of the present invention are not limited to the particular embodiments described herein, but are capable of numerous obvious changes, rearrangements and substitutions without departing from the scope of the embodiments of the present invention. Therefore, while the embodiments of the present invention have been described in connection with the above embodiments, the embodiments of the present invention are not limited to the above embodiments, but may include many other equivalent embodiments without departing from the spirit of the embodiments of the present invention, and the scope of the embodiments of the present invention is determined by the scope of the appended claims.

Claims

1. An audio data processing method for speech recognition, comprising:

Determining that the corresponding endpoint detector detects the endpoint of the audio bit stream according to the determined result of whether the audio bit stream is greater than the maximum detection length, wherein the method comprises the following steps: detecting silence descriptors in the audio bitstream if the audio bitstream is not greater than the maximum detection length, and if the silence descriptors are detected, performing end point detection of the audio bitstream according to the positions of the silence descriptors, the end point detectors including silence descriptor detectors, a first end point detector based on a long-term network model, and a second end point detector based on a short-term network model;

2. The audio data processing method according to claim 1, wherein if the silence descriptor is not detected, a first endpoint detector based on a long-term network model performs endpoint detection on the audio bitstream, and the detected endpoint is determined as an endpoint of the audio bitstream.

3. The audio data processing method of claim 2, wherein the long-term network model consists of a convolutional layer output, a recursive layer, and a fully-connected layer of the second endpoint detector.

4. The audio data processing method according to claim 1, wherein if the audio bitstream is greater than the maximum detection length, the detected end point is determined as the end point of the audio bitstream by end point detection of the audio bitstream by a second end point detector based on a short-term network model.

5. The audio data processing method of claim 4, wherein the short-term network model consists of a convolution layer and a full connection layer.

6. The method according to any one of claims 1-5, wherein processing the audio bitstream according to the result of the endpoint detection results in audio samples for speech recognition, comprising:

7. The audio data processing method according to claim 6, wherein if no endpoint is detected in the audio bitstream, the audio bitstream is decoded and buffered.

8. An audio data processing device for speech recognition, comprising:

The audio endpoint detection module is configured to determine, according to a result of determining whether the audio bitstream is greater than a maximum detection length, that a corresponding endpoint detector performs endpoint detection on the audio bitstream, where the method includes: detecting silence descriptors in the audio bitstream if the audio bitstream is not greater than the maximum detection length, and if the silence descriptors are detected, performing end point detection of the audio bitstream according to the positions of the silence descriptors, the end point detectors including silence descriptor detectors, a first end point detector based on a long-term network model, and a second end point detector based on a short-term network model;

9. An audio data processing device for speech recognition, the device comprising: one or more processors; storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the audio data processing method for speech recognition as claimed in any one of claims 1 to 7.

10. A storage medium containing computer executable instructions for performing the audio data processing method for speech recognition according to any one of claims 1-7 when executed by a computer processor.