CN112634907A

CN112634907A - Audio data processing method and device for voice recognition

Info

Publication number: CN112634907A
Application number: CN202011543521.3A
Authority: CN
Inventors: 罗海斯·马尔斯; 胡正倫
Original assignee: Bigo Technology Singapore Pte Ltd
Current assignee: Bigo Technology Singapore Pte Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-04-09

Abstract

The embodiment of the invention discloses an audio data processing method, an audio data processing device, audio data processing equipment and a storage medium for voice recognition, wherein the method comprises the following steps: when the audio bit stream stored in the buffer is larger than the target detection length, determining whether the audio bit stream is larger than the maximum detection length; determining a corresponding endpoint detector to perform endpoint detection on the audio bit stream according to a result of determining whether the audio bit stream is greater than a maximum detection length, wherein the endpoint detector comprises a silence descriptor detector, a first endpoint detector based on a long-term network model and a second endpoint detector based on a short-term network model; and processing the audio bit stream according to the result of the endpoint detection to obtain an audio sample for voice recognition. The scheme avoids the problem that the recognition performance is reduced because the continuous voice segment is divided into different parts for voice recognition, so that the voice area can not be damaged, and the efficiency and the accuracy of the voice recognition are obviously improved.

Description

Audio data processing method and device for voice recognition

Technical Field

The embodiment of the application relates to the field of computers, in particular to an audio data processing method and device for voice recognition.

Background

When audio transmission is performed using an audio stream, the audio stream needs to be segmented for speech recognition. Between the client and the server, the audio stream transmission mode mainly includes the mode of transmitting by adopting audio frames and the mode of transmitting by adopting audio data blocks.

In the transmission mode of adopting audio frames, an audio frame sequence is sent to an audio receiving end from an audio sending end, the audio frame sequence is decoded by the receiving end and is input to a speech recognizer one by one for speech recognition, batch processing cannot be realized by the mode, and the speech recognition efficiency is low.

In a transmission scheme using audio data blocks, each audio data block is transmitted between a client and a server as an audio segment. In the prior art, an audio stream is divided into audio segments of fixed length, as shown in fig. 1, fig. 1a is a schematic diagram of audio data segmentation in the prior art, which divides the audio stream at intervals of 20s each, and this division manner is poor in flexibility, which may cause continuous speech segments to be segmented during speech recognition, thereby reducing speech recognition efficiency.

Disclosure of Invention

The embodiment of the invention provides an audio data processing method, device, equipment and storage medium for voice recognition, which solves the problem of reduced recognition performance caused by the fact that continuous voice segments are divided into different parts for voice recognition, enables a voice area not to be damaged, and obviously improves the efficiency and accuracy of voice recognition.

In a first aspect, an embodiment of the present invention provides an audio data processing method for speech recognition, where the method includes:

when the audio bit stream stored in the buffer is larger than the target detection length, determining whether the audio bit stream is larger than the maximum detection length;

determining a corresponding endpoint detector to perform endpoint detection on the audio bit stream according to a result of determining whether the audio bit stream is greater than a maximum detection length, wherein the endpoint detector comprises a silence descriptor detector, a first endpoint detector based on a long-term network model and a second endpoint detector based on a short-term network model;

and processing the audio bit stream according to the result of the endpoint detection to obtain an audio sample for voice recognition.

In a second aspect, an embodiment of the present invention further provides an audio data processing apparatus for speech recognition, where the apparatus includes:

the audio length detection module is used for determining whether the audio bit stream stored in the buffer area is larger than the maximum detection length or not when the audio bit stream is larger than the target detection length;

an audio endpoint detection module, configured to determine, according to a result of determining whether the audio bitstream is greater than a maximum detection length, that a corresponding endpoint detector performs endpoint detection on the audio bitstream, where the endpoint detector includes a silence descriptor detector, a first endpoint detector based on a long-term network model, and a second endpoint detector based on a short-term network model;

and the audio processing module is used for processing the audio bit stream according to the result of the endpoint detection to obtain an audio sample for voice recognition.

In a third aspect, an embodiment of the present invention further provides an audio data processing apparatus for speech recognition, where the apparatus includes:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the audio data processing method for speech recognition according to the embodiment of the present invention.

In a fourth aspect, the present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are used for executing the audio data processing method for speech recognition according to the present invention.

In the embodiment of the invention, when the audio bit stream stored in the buffer is greater than the target detection length, whether the audio bit stream is greater than the maximum detection length is determined, and according to the result that whether the determined audio bit stream is greater than the maximum detection length, the corresponding endpoint detector is determined to carry out endpoint detection on the audio bit stream, wherein the endpoint detector comprises a silence descriptor detector, a first endpoint detector based on a long-term network model and a second endpoint detector based on a short-term network model, and according to the result of the endpoint detection, the audio bit stream is processed to obtain an audio sample for voice recognition. The scheme avoids the problem that the recognition performance is reduced because the continuous voice segment is divided into different parts for voice recognition, so that the voice area can not be damaged, and the efficiency and the accuracy of the voice recognition are obviously improved.

Drawings

FIG. 1a is a schematic diagram of an audio data segment in the prior art;

fig. 1 is a flowchart of an audio data processing method according to an embodiment of the present invention;

FIG. 1b is a schematic diagram of the mixing of speech information and fixed noise information provided by the present invention;

FIG. 1c is a schematic diagram of the mixing of speech information and non-stationary noise provided by the present invention;

FIG. 1d is a schematic diagram of an embodiment of an endpoint detection method for an audio bitstream;

FIG. 2 is a flow chart of another audio data processing method according to an embodiment of the invention;

FIG. 3 is a flow chart of another audio data processing method according to an embodiment of the invention;

fig. 3a is a marked schematic diagram of an end point detection performed on audio data by a first end point detector based on a long-term network model according to an embodiment of the present invention;

FIG. 4 is a flow chart of another audio data processing method according to an embodiment of the invention;

fig. 4a is a schematic diagram of an audio information buffer according to an embodiment of the present invention;

fig. 5 is a block diagram of an audio data processing apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad invention. It should be further noted that, for convenience of description, only some structures, not all structures, relating to the embodiments of the present invention are shown in the drawings.

Fig. 1 is a flowchart of an audio data processing method according to an embodiment of the present invention, where the embodiment is applicable to a speech recognition process, and the method may be executed by a computing device such as a mobile phone, a notebook, an iPad, a server, a desktop, and the like, and specifically includes the following steps:

step S101, when the audio bit stream stored in the buffer area is larger than the target detection length, determining whether the audio bit stream is larger than the maximum detection length.

In one embodiment, the audio data is encoded by the sending end in an audio bit stream mode and then transmitted to the receiving end, and the receiving end stores the received audio bit stream into the buffer area so as to perform batch processing operation for speech recognition and obtain corresponding speech content. In the voice recognition process, in order to improve the real-time performance of voice recognition and avoid the problem of low recognition accuracy rate caused by over-short recognized audio bit stream, a target detection length and a maximum detection length are set. Illustratively, the target detection length may be 20 seconds of the audio bitstream and the maximum detection length may be 30 seconds of the audio bitstream.

And after storing a section of received audio bit stream into a buffer area, detecting the length of the audio bit stream, and when the length is greater than the target detection length, judging whether the length is greater than the maximum detection length, and segmenting the audio bit stream according to conditions.

And step S102, determining that the corresponding endpoint detector carries out endpoint detection on the audio bit stream according to the result of determining whether the audio bit stream is larger than the maximum detection length.

In one embodiment, three endpoint detectors are provided to detect the endpoints of an audio bitstream for speech recognition after it has been segmented. In the speech recognition process, in order to improve the recognition efficiency, it is necessary to reasonably segment the audio information, and an existing conventional segmentation manner is to segment the audio bitstream by using a fixed time interval, such as 20s, for example, the segmentation manner shown in fig. 1a, which divides continuous speech frame information into different segments, thereby causing the problems of poor speech recognition accuracy and low performance.

According to the scheme, segmentation is not carried out in a fixed time interval mode, but a mode of cascading three endpoint detectors, namely a silence descriptor detector, a first endpoint detector based on a long-term network model and a second endpoint detector based on a short-term network model, is used. Specifically, under some circumstances, the audio bitstream sending end may mark non-speech information, that is, silence information, for example, add a silence descriptor to indicate that a subsequent audio frame is a non-speech frame, and when SID is 0, the silence descriptor is used to indicate that a subsequent audio frame is a non-speech frame, that is, an end point of a current speech frame; when the sending end starts to send the voice frame information, a silence descriptor is added to make SID 1 to indicate that the subsequent audio frame is a voice frame. But in some cases, such as non-stationary noisy scenes, the statistical approach will be significantly less feasible. As shown in fig. 1b, fig. 1b is a schematic diagram of the mixing of the speech information and the fixed noise information provided by the present invention, and it can be known from fig. 1b that the speech frame and the non-speech frame can be easily distinguished. Fig. 1c is a schematic diagram of the mixing of speech information and non-stationary noise provided by the present invention, and it can be seen from fig. 1c that it is difficult to distinguish between speech frames and non-speech frames by the conventional statistical method.

Correspondingly, the scheme is also provided with a first endpoint detector based on the long-term network model and a second endpoint detector based on the short-term network model. The short-term network model consists of a convolutional layer and a full-link layer, and the long-term network model consists of a convolutional layer output of the second endpoint detector, a recursive layer and a full-link layer.

The first endpoint detector based on the long-term network model is used for effectively distinguishing voice and high-energy non-voice, has strong association and memory characteristics, and is very high in detection precision. The correlation and memory characteristics of the second endpoint detector based on the short-term network model are stronger than those of the first endpoint detector, the detection precision is moderate, and the method is more suitable for distinguishing the voice frame and the non-voice frame of the audio information with longer length.

In one embodiment, the dynamic configuration selects different endpoint detectors to determine the corresponding endpoints of the audio bitstream based on the result of determining whether the audio bitstream is greater than the maximum detection length.

Step S103, according to the result of the endpoint detection, the audio bit stream is processed to obtain an audio sample for voice recognition.

In one embodiment, endpoint detection is performed when corresponding silence descriptor detectors, long-term network model-based first endpoint detectors, and short-term network model-based second endpoint detectors are selected for different audio bitstream lengths to distinguish between corresponding speech information and non-speech information of the audio bitstream. For example, as shown in fig. 1d, fig. 1d is a schematic diagram of performing endpoint detection on an audio bitstream according to an embodiment of the present invention, and as can be seen from fig. 1d, if detected endpoints are endpoint 1 and endpoint 2, speech frame information between endpoint 1 and endpoint 2 is segmented to obtain an audio sample to be speech-recognized.

In another embodiment, if no endpoint is detected in the current audio bitstream, that is, all information corresponding to the current audio bitstream is speech information, the whole of the information is buffered as an audio sample for subsequent speech recognition.

According to the scheme, the corresponding endpoint detector is determined to carry out endpoint detection on the audio bit stream according to the result of determining whether the determined audio bit stream is larger than the maximum detection length, the endpoint detector comprises a silence descriptor detector, a first endpoint detector based on a long-term network model and a second endpoint detector based on a short-term network model, and the audio bit stream is processed according to the result of the endpoint detection to obtain an audio sample for voice recognition, so that the flexibility of audio information segmentation is improved, and the efficiency and the accuracy of the voice recognition are obviously improved.

Fig. 2 is a flowchart of another audio data processing method according to an embodiment of the present invention, which shows a specific method for determining whether a corresponding endpoint detector performs endpoint detection on an audio bitstream according to a result of determining whether the determined result is greater than a maximum detection length. As shown in fig. 2, the technical solution is as follows:

step S201, when the audio bitstream stored in the buffer is greater than the target detection length, determining whether the audio bitstream is greater than the maximum detection length.

Step S202, if the audio bit stream is not larger than the maximum detection length, performing endpoint detection on the audio bit stream through a silence descriptor detector and a first endpoint detector based on a long-term network model.

In one embodiment, when it is determined that the audio bitstream is not greater than the maximum detection length, endpoint detection is performed using a silence descriptor detector with good detection effect and a first endpoint detector based on a long-term network model.

Specifically, whether a silence identifier exists in the audio bitstream is detected first, if so, an endpoint is determined according to the indication of the silence identifier, and if so, that is, if SID is detected to be 0, the audio bitstream after the silence identifier is determined to be a non-speech frame for segmentation.

If the silence identifier does not exist or the SID value is detected to be 1, the audio bit stream is decoded, the decoded audio data is input to a first endpoint detector based on a long-term network model, and endpoint detection is carried out through the first endpoint detector of the long-term network model. If the non-voice information endpoint is detected, corresponding audio information segmentation is carried out to obtain an audio sample for voice recognition; and if the non-voice information endpoint is not detected, caching and storing the audio information as an audio sample in a cache mode.

Step S203, according to the result of the endpoint detection, processing the audio bitstream to obtain an audio sample for speech recognition.

According to the scheme, if the audio bit stream is not larger than the maximum detection length, the silence descriptor detector and the first endpoint detector based on the long-term network model are used for carrying out endpoint detection on the audio bit stream so as to reasonably identify the non-speech frames, and the audio information is segmented based on the non-speech frames, so that the problems of low identification precision and low efficiency caused by the fact that continuous speech information is segmented into different parts for speech identification are solved.

Fig. 3 is a flowchart of another audio data processing method according to an embodiment of the present invention, which shows a specific method for performing endpoint detection on the audio bitstream by using a silence descriptor detector and a first endpoint detector based on a long-term network model. As shown in fig. 3, the technical solution is as follows:

step S301 detects the length of the audio bitstream stored in the buffer.

Step S301, judging whether the length of the audio bit stream is greater than the target detection length, if so, executing step S302, otherwise, continuing to execute step S301.

Step S303, judging whether the audio bit stream is larger than the maximum detection length, if not, executing step S304.

Step S304, detecting the silence descriptor in the audio bit stream.

Step S305, determining whether a silence descriptor is detected, if yes, executing step S306, otherwise executing step S307.

And step S306, determining the endpoint of the audio bit stream according to the position of the silence descriptor.

And step S307, performing endpoint detection on the audio bit stream by a first endpoint detector based on a long-term network model, and determining the detected endpoint as the endpoint of the audio bit stream.

In one embodiment, the audio bitstream is decoded and input to a first endpoint detector based on a long-term network model, for example, a plurality of audio frames with preset lengths are marked by the first endpoint detector based on the long-term network model, for example, a voice information frame group is marked as 1, a non-voice information frame group is marked as 0, and an endpoint is determined according to the frame group marking condition. As shown in fig. 3a, fig. 3a is a schematic diagram illustrating an end point detection performed on audio data by a first end point detector based on a long-term network model according to an embodiment of the present invention, and it can be seen that the first region is a non-speech frame region and the second region is a speech frame region, so that the audio information is segmented based on the end point a to obtain audio samples (the second region) for performing speech recognition subsequently.

Step S308, according to the result of the endpoint detection, the audio bit stream is processed to obtain an audio sample for voice recognition.

According to the scheme, the first endpoint detector based on the long-term network model carries out endpoint detection on the audio bit stream, effectively distinguishes voice information and non-voice information, further carries out corresponding segmentation processing, avoids the problem of voice recognition error caused by splitting the voice information, and improves the accuracy of voice or other information.

Fig. 4 is a flowchart of another audio data processing method for speech recognition according to an embodiment of the present invention, which shows a specific processing method for audio bitstreams larger than the maximum detection length. As shown in fig. 4, the technical solution is as follows:

step S401 detects the length of the audio bitstream stored in the buffer.

Step S401, judging whether the length of the audio bit stream is larger than the target detection length, if so, executing step S402, otherwise, continuing to execute step S401.

Step S403, determining whether the audio bitstream is greater than the maximum detection length, if not, performing step S404, and if so, performing step S411.

Step S404, detecting a silence descriptor in the audio bitstream.

Step S405, whether the silence descriptor is detected is determined, if yes, step S406 is executed, otherwise, step S407 is executed.

Step S406, determining the end point of the audio bit stream according to the position of the silence descriptor, and performing audio segmentation to obtain an audio sample.

Step S407, decoding the audio bitstream, and performing endpoint detection on the decoded audio information by using a first endpoint detector based on the long-term network model.

Step S408, whether an end point is detected or not is determined, if yes, step S409 is executed, and if not, step S410 is executed.

Step S409, determining the detected endpoint as the endpoint of the audio bitstream, and performing audio segmentation to obtain an audio sample.

And step S410, caching and storing the audio information.

And step S411, carrying out end point detection on the audio bit stream through a second end point detector based on a short-term network model.

In one embodiment, when it is determined that the audio bitstream is greater than the maximum detection length, the audio bitstream is end-detected using the second end-point detector based on the short-term network model. The accuracy and efficiency of the endpoint detection are improved by utilizing the characteristics of the network model of the endpoint detection.

Step S412, determining whether an end point is detected, if so, executing step S409, otherwise, executing step S410.

According to the scheme, the received audio bit stream is segmented in a cascade mode of a plurality of end point detectors, the appropriate end point detectors are selected according to the lengths of different input audio bit streams for end point detection, if the audio bit stream is not larger than the maximum detection length, the audio bit stream is subjected to end point detection through the silence descriptor detector and the first end point detector based on the long-term network model, if the audio bit stream is larger than the maximum detection length, the audio bit stream is subjected to end point detection through the second end point detector based on the short-term network model, the detected end points are determined as the end points of the audio bit stream, therefore, reasonable audio segmentation is achieved, audio samples are generated, and the accuracy of subsequent speech recognition is improved.

On the basis of the technical scheme, if the endpoint is not detected in the audio information after the current audio bit stream is decoded, the endpoint is not segmented, but is cached and stored. Fig. 4a is a schematic view of audio information caching according to an embodiment of the present invention, as shown in fig. 4a, audio information x is decoded information corresponding to one input audio bit stream, audio information y is decoded information corresponding to another input audio bit stream, and no endpoint is detected in the audio information x and the audio information y, both the audio information x and the audio information y are represented as voice information, and after the voice information x and the audio information y are sequentially cached, an input value voice recognizer performs voice recognition. The audio processing mode ensures that continuous voice information is not segmented, and improves the voice recognition performance.

Fig. 5 is a block diagram of an audio data processing apparatus according to an embodiment of the present invention, where the apparatus is configured to execute the audio data processing method for speech recognition according to the embodiment of the present invention, and has corresponding functional modules and beneficial effects of the execution method. As shown in fig. 5, the apparatus specifically includes: an audio length detection module 101, an audio endpoint detection module 102, and an audio processing module 103, wherein,

an audio length detection module 101, configured to determine whether an audio bitstream stored in a buffer is greater than a target detection length, if so, whether the audio bitstream is greater than a maximum detection length;

an audio endpoint detection module 102, configured to determine, according to a result of determining whether the audio bitstream is greater than a maximum detection length, that a corresponding endpoint detector performs endpoint detection on the audio bitstream, where the endpoint detector includes a silence descriptor detector, a first endpoint detector based on a long-term network model, and a second endpoint detector based on a short-term network model;

and the audio processing module 103 is configured to process the audio bit stream according to the result of the endpoint detection to obtain an audio sample for speech recognition.

According to the scheme, when the audio bit stream stored in the buffer area is larger than the target detection length, whether the audio bit stream is larger than the maximum detection length is determined; determining a corresponding endpoint detector to perform endpoint detection on the audio bit stream according to a result of determining whether the audio bit stream is greater than a maximum detection length, wherein the endpoint detector comprises a silence descriptor detector, a first endpoint detector based on a long-term network model and a second endpoint detector based on a short-term network model; and processing the audio bit stream according to the result of the endpoint detection to obtain an audio sample for voice recognition. The scheme avoids the problem that the recognition performance is reduced because the continuous voice segment is divided into different parts for voice recognition, so that the voice area can not be damaged, and the efficiency and the accuracy of the voice recognition are obviously improved.

In a possible embodiment, the audio endpoint detection module 102 is specifically configured to:

performing end-point detection on the audio bitstream by a silence descriptor detector and a first end-point detector based on a long-term network model if the audio bitstream is not greater than the maximum detection length.

and detecting a silence descriptor in the audio bit stream, and if the silence descriptor is detected, determining an endpoint of the audio bit stream according to the position of the silence descriptor.

if the silence descriptor is not detected, performing endpoint detection on the audio bitstream based on a first endpoint detector of a long-term network model, and determining the detected endpoint as the endpoint of the audio bitstream.

In one possible embodiment, the long-term network model consists of convolutional layer outputs, recursive layers, and fully-connected layers of the second endpoint detector.

determining the detected end point as the end point of the audio bit stream by end point detecting the audio bit stream by a second end point detector based on a short-term network model if the audio bit stream is larger than the maximum detection length.

In one possible embodiment, the short-term network model consists of a convolutional layer and a fully-connected layer.

In a possible embodiment, the audio processing module 103 is specifically configured to:

and if the endpoint is detected in the audio bit stream, performing segmentation and decoding processing according to the position of the endpoint to generate a plurality of audio samples for speech recognition.

decoding and buffering the audio bitstream if an endpoint is not detected in the audio bitstream.

Fig. 6 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes a processor 201, a memory 202, an input device 203, and an output device 204; the number of the processors 201 in the device may be one or more, and one processor 201 is taken as an example in fig. 6; the processor 201, the memory 202, the input device 203 and the output device 204 in the apparatus may be connected by a bus or other means, for example in fig. 6. The memory 202, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the audio data processing method for speech recognition in the embodiments of the present invention. The processor 201 executes various functional applications of the device and data processing, i.e., implements the above-described audio data processing method for voice recognition, by executing software programs, instructions, and modules stored in the memory 202. The input device 203 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the apparatus. The output device 204 may include a display device such as a display screen.

Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method of audio data processing for speech recognition, the method comprising:

From the above description of the embodiments, it is obvious for those skilled in the art that the embodiments of the present invention can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better implementation in many cases. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions to make a computer device (which may be a personal computer, a service, or a network device) perform the methods described in the embodiments of the present invention.

It should be noted that, in the above-mentioned embodiment of the audio data processing apparatus for speech recognition, the included units and modules are merely divided according to the functional logic, but are not limited to the above-mentioned division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the embodiment of the invention.

It should be noted that the foregoing is only a preferred embodiment of the present invention and the technical principles applied. Those skilled in the art will appreciate that the embodiments of the present invention are not limited to the specific embodiments described herein, and that various obvious changes, adaptations, and substitutions are possible, without departing from the scope of the embodiments of the present invention. Therefore, although the embodiments of the present invention have been described in more detail through the above embodiments, the embodiments of the present invention are not limited to the above embodiments, and many other equivalent embodiments may be included without departing from the concept of the embodiments of the present invention, and the scope of the embodiments of the present invention is determined by the scope of the appended claims.

Claims

1. An audio data processing method for speech recognition, comprising:

2. The audio data processing method of claim 1, wherein the determining, according to the result of determining whether the determined length is greater than the maximum detection length, that the corresponding endpoint detector performs endpoint detection on the audio bitstream comprises:

3. The audio data processing method of claim 2, wherein the performing of the endpoint detection on the audio bitstream by the silence descriptor detector and the first endpoint detector based on the long-term network model comprises:

4. The audio data processing method according to claim 3, wherein if the silence descriptor is not detected, performing endpoint detection on the audio bitstream based on a first endpoint detector of a long-term network model, and determining the detected endpoint as an endpoint of the audio bitstream.

5. The audio data processing method of claim 4, wherein the long-term network model consists of a convolutional layer output, a recursive layer, and a fully-connected layer of the second endpoint detector.

6. The audio data processing method according to claim 1, wherein if the audio bitstream is larger than the maximum detection length, the detected end point is determined as the end point of the audio bitstream by performing end point detection on the audio bitstream by a second end point detector based on a short-term network model.

7. The audio data processing method of claim 6, wherein the short-term network model is composed of a convolutional layer and a fully-connected layer.

8. The audio data processing method according to any one of claims 1 to 7, wherein the processing the audio bitstream according to the result of the endpoint detection to obtain audio samples for speech recognition comprises:

9. The audio data processing method according to claim 8, wherein if no endpoint is detected in the audio bitstream, the audio bitstream is decoded and buffered.

10. Audio data processing apparatus for speech recognition, comprising:

11. An audio data processing device for speech recognition, the device comprising: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the audio data processing method for speech recognition according to any one of claims 1 to 9.

12. A storage medium containing computer-executable instructions for performing the audio data processing method for speech recognition according to any one of claims 1-9 when executed by a computer processor.