CN112614515A

CN112614515A - Audio processing method and device, electronic equipment and storage medium

Info

Publication number: CN112614515A
Application number: CN202011508130.8A
Authority: CN
Inventors: 曾耀武; 黄强; 谭安林
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-04-06
Anticipated expiration: 2040-12-18
Also published as: CN112614515B

Abstract

The application provides an audio processing method, an audio processing device, electronic equipment and a storage medium, and relates to the technical field of Internet, wherein a target audio clip is determined in an audio file according to a preset sliding window and the sequence of time, and the target audio clip is added to a pre-configured buffer area under the condition that the target audio clip is determined to be a sound clip and the recorded round mark is an effective audio; therefore, when the audio clips stored in the buffer area are used for marking the training samples, the proportion of the low-quality training samples can be reduced, and the output rate of marking is improved.

Description

Audio processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium.

Background

With the development of artificial intelligence technology, deep learning models have more applications in life and production, such as human-computer interaction, voice recognition, telecom fraud detection and the like.

Before the deep learning model is trained, a large number of training samples with classification labels need to be obtained, so that the deep learning model is trained by using the corresponding training samples with the classification labels.

However, in the process of labeling the training samples, there may be a large number of samples with lower quality, resulting in a lower labeling yield.

Disclosure of Invention

The application aims to provide an audio processing method, an audio processing device, an electronic device and a storage medium, which can improve the output rate of labels.

In order to achieve the purpose, the technical scheme adopted by the application is as follows:

in a first aspect, the present application provides an audio processing method, including:

determining a target audio clip in an audio file by using a preset sliding window according to the time sequence;

if the target audio clip is determined to be a sound clip and the recorded turn is marked as effective audio, adding the target audio clip to a pre-configured buffer area; wherein the turn mark is used for indicating that the currently processed audio segment is valid audio or invalid audio.

In a second aspect, the present application provides an audio processing apparatus, the apparatus comprising:

the sliding module is used for determining a target audio clip in the audio file by using a preset sliding window according to the time sequence;

the processing module is used for adding the target audio clip to a pre-configured buffer area if the target audio clip is determined to be an audio clip and the recorded turn is marked as effective audio; wherein the turn mark is used for indicating that the currently processed audio segment is valid audio or invalid audio.

In a third aspect, the present application provides an electronic device comprising a memory for storing one or more programs; a processor; the one or more programs, when executed by the processor, implement the audio processing method described above.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the audio processing method described above.

According to the audio processing method, the audio processing device, the electronic equipment and the storage medium, the target audio clip is determined in the audio file according to the sequence of time through the preset sliding window, and the target audio clip is added to the pre-configured buffer area under the condition that the target audio clip is determined to be a sound clip and the recorded turn is marked as effective audio; therefore, when the audio clips stored in the buffer area are used for marking the training samples, the proportion of the low-quality training samples can be reduced, and the output rate of marking is improved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly explain the technical solutions of the present application, the drawings needed for the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also derive other related drawings from these drawings without inventive effort.

Fig. 1 shows a schematic structural block diagram of an electronic device provided by the present application.

Fig. 2 shows a schematic flow chart of an audio processing method provided by the present application.

Fig. 3 shows a scene schematic diagram of an audio processing method provided by the present application.

Fig. 4 shows a block flow diagram of an audio processing method provided by the present application.

Fig. 5 shows a schematic block diagram of an audio processing apparatus provided in the present application.

In the figure: 100-an electronic device; 101-a memory; 102-a processor; 103-a communication interface; 300-an audio processing device; 301-a sliding module; 302-processing module.

Detailed Description

To make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be clearly and completely described below with reference to the accompanying drawings in some embodiments of the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. The components of the present application, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments obtained by a person of ordinary skill in the art based on a part of the embodiments in the present application without any creative effort belong to the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Taking the application of the deep learning model in the intelligent voice technology as an example, before the deep learning model is trained, the audio data or the video data can be edited to obtain a large number of training samples, and the training samples are labeled, so that a large number of training samples with classification labels are obtained to learn the deep learning model.

However, in the training samples obtained by clipping, there may be training samples with lower quality, such as silence samples, noise samples, and small effective audio ratio, so that when labeling the training samples, many training samples with lower quality may be labeled, thereby reducing the labeling yield of high-quality training samples.

Therefore, in order to at least partially improve the drawbacks of the above solutions, some possible embodiments provided by the present application are: determining a target audio clip in an audio file according to the sequence of time by using a preset sliding window, and adding the target audio clip to a pre-configured buffer area under the condition that the target audio clip is determined to be a voiced clip and the recorded turn is marked as an effective audio; therefore, when the audio clips stored in the buffer area are used for marking the training samples, the proportion of the low-quality training samples can be reduced, and the output rate of marking is improved.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 shows a schematic block diagram of an electronic device 100 provided in the present application, and in some embodiments, the electronic device 100 may include a memory 101, a processor 102, and a communication interface 103, and the memory 101, the processor 102, and the communication interface 103 are electrically connected to each other directly or indirectly to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

The memory 101 may be used to store software programs and modules, such as program instructions/modules corresponding to the audio processing apparatus provided in the present application, and the processor 102 executes the software programs and modules stored in the memory 101 to execute various functional applications and data processing, thereby executing the steps of the audio processing method provided in the present application. The communication interface 103 may be used for communicating signaling or data with other node devices.

The Memory 101 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Programmable Read-Only Memory (EEPROM), and the like.

The processor 102 may be an integrated circuit chip having signal processing capabilities. The Processor 102 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

It will be appreciated that the configuration shown in FIG. 1 is merely illustrative and that electronic device 100 may include more or fewer components than shown in FIG. 1 or may have a different configuration than shown in FIG. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

The following takes the electronic device 100 shown in fig. 1 as a schematic execution subject to exemplarily explain the audio processing method provided by the present application.

Referring to fig. 2, fig. 2 shows a schematic flow chart of an audio processing method provided by the present application, which may include the following steps in some embodiments:

step 201, according to the sequence of time, a preset sliding window is used to determine a target audio clip in an audio file.

In step 203, if it is determined that the target audio segment is a voiced segment and the recorded turn is marked as a valid audio, the target audio segment is added to a pre-configured buffer.

In some embodiments, the electronic device may cut the original audio file by using a tool such as VAD (Voice Activity Detection), and combining a preset sliding window (for example, 50ms) according to the time sequence to obtain a plurality of audio segments.

For example, as shown in fig. 3, for a cut audio file, the electronic device may cut out a first audio segment, a second audio segment, a third audio segment, a fourth audio segment … …, a twelfth audio segment, and a thirteenth audio segment according to a preset sliding window.

For example, in some possible embodiments, in the process of executing step 201, the electronic device may utilize a VAD tool to cut the audio file, and then sequentially use the obtained multiple audio segments as target audio segments according to the chronological order to execute step 203; or, the electronic device may also use the audio segment obtained by each cutting as the target audio segment in sequence along with the movement of the sliding window in the process of cutting the audio file, so as to execute step 203 using the target audio segment until the cutting of the audio file is finished.

In addition, the electronic equipment also records a turn mark, and the turn mark can be used for indicating that the audio clip currently processed by the electronic equipment is valid audio or invalid audio; when the turn is marked as invalid audio, the turn indicates that the electronic equipment currently processes audio segments with lower quality, such as mute audio, noise audio and the like; when the turn is marked as the valid audio, it indicates that the electronic device is currently processing the audio segment with higher quality, such as the valid audio being larger.

Based on this, for the target audio segment determined by the electronic device, the electronic device may detect the target audio file by using schemes such as audio energy detection, deep learning model recognition, and the like, so as to determine that the target audio file is a voiced segment or a silent segment; if the detection result indicates that the target audio file is an audio clip and the round mark recorded by the electronic device is an effective audio, the electronic device may determine that the target audio clip is an audio clip with higher quality, and the electronic device may add the target audio clip to a pre-configured buffer, where the buffer may be used to store the audio clip with higher quality inspection.

Therefore, according to the audio scheme provided by the application, when the audio segments stored in the buffer area are used for marking the training samples, the proportion of the low-quality training samples can be reduced, and the output rate of the marking is improved.

In some possible scenarios, the electronic device may determine whether the target audio segment is a voiced segment by using an audio energy detection scheme. For example, in some embodiments, the electronic device may calculate a target audio energy value corresponding to the target audio segment, compare the target audio energy value with a set audio energy threshold, and if the target audio energy value reaches the set audio energy threshold, the electronic device may determine that the target audio segment is a voiced segment; on the contrary, if the target audio energy value does not reach the set audio energy threshold, the electronic device may determine that the target audio segment is a silent segment.

It should be noted that in some possible scenarios, the sound energy value is generally a sine or a pre-waveform diagram, and in a period of time, if the energy values of all the audio frequency points are added, the calculated result may be 0. Therefore, in order to effectively calculate the target audio energy value corresponding to the target audio segment, the electronic device sums the energy values of all audio frequency points in the target audio segment after squaring in a manner of solving a sum of squares, and takes the result obtained by the summation as the target audio energy value corresponding to the target audio segment.

Of course, it is understood that the above is only an example, and in some other possible embodiments of the present application, a sum of absolute values may also be adopted, and energy values of all audio points in a target audio segment are summed after taking absolute values, so that a result obtained by summing after taking absolute values is used as a target audio energy value corresponding to the target audio segment.

In addition, in the above-mentioned solution provided by the present application, when the target audio segment is determined to be a voiced segment and the recorded turn is marked as valid audio, the target audio segment is determined to be a higher-quality audio segment, and the target audio segment is added to the buffer.

However, in some other possible scenarios, if the electronic device determines that the target audio segment is a voiced segment, but the turn recorded by the electronic device is marked as invalid audio, it indicates that the electronic device cuts a voiced segment in the case of processing an audio segment with lower quality; based on this, in this scenario, the electronic device may first determine whether a recorded voiced count value is greater than a first threshold, where the voiced count value represents a number of times that the electronic device continuously recorded voiced segments in an inactive audio state.

If the voiced count value is less than or equal to the first threshold, indicating that the target audio segment may be a transient noise segment, the electronic device may discard the target audio segment and update the voiced count value, such as adding one to the voiced count value; on the contrary, if the sound count value is greater than the first threshold value, indicating that the first threshold number of sound segments have been recorded continuously, the target audio segment may be valid audio with higher quality, the electronic device may add the target audio segment to the buffer, reset the sound count value, and update the turn flag to the valid audio, for instructing the electronic device to enter a state of acquiring the valid audio.

Thus, the proportion of the noise samples stored in the buffer area can be reduced, and the output rate during marking can be further improved.

It can be understood that, in the above-mentioned solution provided in the present application, a scenario in which the electronic device determines that the target audio segment is a voiced segment is provided, in some other possible scenarios in the present application, the target audio segment may also be determined as a silent segment, at this time, if the turn mark is an invalid audio, it indicates that the electronic device continues to probably obtain a low-quality audio, and the drone may discard the target audio segment.

On the other hand, in some possible scenarios of the present application, if the electronic device determines that the target audio segment is a silence segment, but the turn is marked as valid audio, the electronic device may first update the recorded silence count value, for example, increment the silence count value by one, where the silence count value represents the number of times that the silence segment is recorded continuously in a valid audio state.

Then, the electronic device may compare the updated mute count value with a second threshold value, and determine whether the updated mute count value is greater than the second threshold value; if the updated silence count value is less than or equal to the second threshold value, the electronic device may add the target audio segment to the buffer, which represents that the silence segment may be a normal pause between consecutive voices; on the contrary, if the updated silence count value is greater than the second threshold value, which indicates that the silence segment may be a gap of the last speech segment, the electronic device may discard the target audio segment, reset the silence count value, and update the turn flag to the invalid audio, so as to instruct the electronic device to enter a state of acquiring the invalid audio.

It should be noted that, in some possible scenarios of the present application, if the electronic device determines that the target audio segment is a silence segment, but the turn is marked as an effective audio, the electronic device may update the recorded silence count value first by using the above-mentioned scheme, and then compare the updated silence count value with the second threshold value; in some other possible scenarios, the electronic device may further perform the step of comparing the mute count value with the second threshold, and when it is determined that the mute count value is smaller than the second threshold, perform the step of updating the recorded mute count value, which is not limited in the present application.

In addition, since the audio segments obtained by cutting according to the preset sliding window are stored in the buffer, the duration of each audio segment is generally short, such as 50 ms; therefore, in some possible scenarios, the electronic device may merge part of the audio segments in the buffer into an audio sequence, so that the merged audio sequence is labeled as a training sample before the deep learning model is trained.

In some possible scenarios, when the electronic device has continuously acquired a plurality of silence segments, for example, in the above example, when the updated silence count value is greater than the second threshold, it indicates that the electronic device has continuously acquired at least a second threshold number of silence segments, and a long voice pause may occur in the audio file, at this time, the electronic device may compare the accumulated duration of all audio segments stored in the buffer with the set first duration threshold, and if the accumulated duration of all audio segments stored in the buffer reaches the set first duration threshold, the electronic device may merge all audio segments stored in the buffer into an audio sequence and reset the buffer.

The set first time length threshold may be used to indicate a minimum length of the merged audio sequences, and the first time length threshold may constrain the minimum time lengths of all the audio sequences obtained by merging, so as to avoid a large time length difference between all the audio sequences.

In addition, in combination with the above description, in some possible scenarios, when the electronic device cuts the voiced segments continuously, the electronic device may store a plurality of voiced segments in the buffer continuously.

Therefore, in order to avoid that the audio duration of the combined audio sequence is too large, the electronic device may compare, based on the set second duration threshold, the accumulated duration of all audio segments stored in the buffer with the set second duration threshold every time the electronic device adds one audio segment (such as a voiced segment or a silent segment) to the buffer; if the accumulated time length of all the audio segments stored in the buffer area reaches the set second time length threshold, the electronic device may merge all the audio segments stored in the buffer area into an audio sequence, and reset the buffer area to continue the merging of the next audio sequence. Therefore, the time length of the audio sequences obtained by combination is constrained through the set second time length threshold, and the condition that the time length difference between different audio sequences is overlarge can be avoided.

It is to be understood that, in some possible embodiments of the present application, the first duration threshold may be smaller than the second duration threshold.

In addition, for example, in the scenario of training the deep learning model described in this application, the electronic device may further store an audio classification model that is trained in advance, and after obtaining the audio sequence, the electronic device may input the audio sequence to the audio classification model and obtain an audio class label output by the audio classification model, for example, the audio class label output by the audio classification model may be a voice, silence, BGM, singing, or the like.

The above audio processing scheme provided by the present application is again exemplified below with reference to fig. 3 and 4.

Assuming that the first threshold is 3, the second threshold is 8, the first duration threshold is 5s, the second duration threshold is 8s, the first audio segment, the second audio segment, and the third audio segment are all valid segments, the turn is marked as valid audio, the silence count value is 0, and the voiced count value is 3, then:

when the electronic device determines that the fourth audio segment is a voiced segment and the electronic device determines that the turn is marked as a valid audio, the electronic device adds the fourth audio segment to the buffer queue.

When the electronic equipment determines that the fifth audio clip is a mute clip and the electronic equipment judges that the turn is marked as effective audio, the electronic equipment updates the mute count value to 1; the electronic device then determines that the updated silence count value 1 is less than the second threshold value 8, and the electronic device adds a fifth audio clip to the buffer.

……

In addition, assuming that the thirty-th audio segment is a voiced segment and the turn is marked as an invalid audio at this time, the voiced count value is 4, then: the electronic device adds the thirty-th audio segment to the buffer and resets the voiced count value to 0 and modifies the turn flag to the valid audio.

……

Further, assuming that the fortieth audio segment is a silence segment and the turn is marked as active audio, the silence count value is 8: the electronic equipment adds 1 to the mute count value and updates the mute count value to 9; if the updated mute count value 9 is greater than the second threshold value 8, the electronic device modifies the turn mark into invalid audio, resets the mute count value to 0, and judges whether the accumulated time of all audio segments stored in the buffer area reaches 5 s; if so, all audio segments in the buffer are merged into one audio sequence and the buffer is reset.

In addition, each time the electronic device adds one audio clip to the buffer area, the electronic device can judge whether the accumulated time of all the audio clips stored in the buffer area reaches 8 s; if so, all audio segments in the buffer are merged into one audio sequence and the buffer is reset.

Referring to fig. 5, fig. 5 shows a schematic block diagram of an audio processing apparatus 300 provided in the present application, wherein the audio processing apparatus 300 may include a sliding module 301 and a processing module 302 in some embodiments.

The sliding module 301 is configured to determine a target audio clip in an audio file by using a preset sliding window according to a time sequence;

a processing module 302, configured to add a target audio segment to a pre-configured buffer if the target audio segment is determined to be an audio segment and the recorded turn is marked as an effective audio; wherein the turn mark is used for indicating that the currently processed audio segment is valid audio or invalid audio.

Optionally, in some possible embodiments of the present application, the processing module 302 is further configured to, if it is determined that the target audio segment is a voiced segment and the recorded turn is marked as an invalid audio, determine whether the recorded voiced count value is greater than a first threshold; wherein the voiced count value represents the number of times that voiced segments are recorded continuously in an invalid audio state;

if the sound count value is less than or equal to the first threshold value, discarding the target audio clip, and updating the sound count value; wherein, the updated sound count value is used for next execution to judge whether the recorded sound count value is larger than the first threshold value;

if the voiced count value is greater than the first threshold, the target audio segment is added to the buffer, the voiced count value is reset, and the turn flag is updated to valid audio.

Optionally, in some possible embodiments of the present application, the processing module 302 is further configured to discard the target audio segment if the target audio segment is determined to be a silent segment and the turn is marked as invalid audio.

Optionally, in some possible embodiments of the present application, the processing module 302 is further configured to update the recorded silence count value if it is determined that the target audio segment is a silence segment and the turn is marked as a valid audio; wherein the silence count value represents the number of times of a silence segment that is continuously recorded in an effective audio state;

judging whether the updated mute count value is larger than a second threshold value;

if the updated silence count value is less than or equal to the second threshold value, adding the target audio clip to the buffer area;

if the updated silence count value is greater than the second threshold value, the target audio clip is discarded, the silence count value is reset, and the turn flag is updated to invalid audio.

Optionally, in some possible embodiments of the present application, if the updated silence count value is greater than the second threshold, the processing module 302 is further configured to, if the cumulative duration of all the audio segments stored in the buffer reaches the set first duration threshold, merge all the audio segments stored in the buffer into an audio sequence, and reset the buffer.

Optionally, in some possible embodiments of the present application, the processing module 302 is further configured to, if the cumulative duration of all the audio segments stored in the buffer reaches the second duration threshold, merge all the audio segments stored in the buffer into an audio sequence, and reset the buffer.

Optionally, in some possible embodiments of the present application, the processing module 302 is further configured to input the audio sequence to a pre-trained audio classification model, and obtain an audio class label output by the audio classification model.

Optionally, in some possible embodiments of the present application, when determining that the target audio segment is a voiced segment, the processing module 302 is specifically configured to:

calculating a target audio energy value corresponding to the target audio fragment;

and if the target audio energy value reaches the set audio energy threshold value, determining that the target audio segment is a sound segment.

Optionally, in some possible embodiments of the present application, when calculating the target audio energy value corresponding to the target audio segment, the processing module 302 is specifically configured to:

and summing the energy values of all the audio frequency points in the target audio frequency fragment after squaring, and taking the result obtained by summation as the target audio frequency energy value corresponding to the target audio frequency fragment.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to some embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).

It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in some embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to some embodiments of the present application. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.

The above description is only a few examples of the present application and is not intended to limit the present application, and those skilled in the art will appreciate that various modifications and variations can be made in the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A method of audio processing, the method comprising:

2. The method of claim 1, wherein the method further comprises:

if the target audio frequency segment is determined to be a sound segment and the recorded round mark is an invalid audio frequency, judging whether the recorded sound count value is larger than a first threshold value or not; wherein the voiced count value characterizes a number of voiced segments that are recorded consecutively in an inactive audio state;

if the voiced count value is less than or equal to the first threshold value, discarding the target audio segment and updating the voiced count value; wherein the updated voiced count value is used for next execution of the step of determining whether the recorded voiced count value is greater than a first threshold value;

if the voiced count value is greater than the first threshold value, adding the target audio segment to the buffer, resetting the voiced count value, and updating the turn flag to a valid audio.

3. The method of claim 1, wherein the method further comprises:

if the target audio segment is determined to be a silent segment and the turn is marked as invalid audio, discarding the target audio segment.

4. The method of claim 1, wherein the method further comprises:

if the target audio clip is determined to be a mute clip and the turn is marked as effective audio, updating the recorded mute count value; wherein the silence count value characterizes a number of times a silence segment is continuously recorded in an active audio state;

judging whether the updated silence count value is larger than a second threshold value;

if the updated silence count value is less than or equal to the second threshold value, adding the target audio clip to the buffer;

if the updated silence count value is greater than the second threshold value, discarding the target audio clip, resetting the silence count value, and updating the turn mark to an invalid audio.

5. The method of claim 4, wherein if the updated silence count value is greater than the second threshold value, the method further comprises:

and if the accumulated time length of all the audio clips stored in the buffer zone reaches a set first time length threshold value, merging all the audio clips stored in the buffer zone into an audio sequence, and resetting the buffer zone.

6. The method of any one of claims 1-4, further comprising:

and if the accumulated time length of all the audio clips stored in the buffer zone reaches a set second time length threshold value, merging all the audio clips stored in the buffer zone into an audio sequence, and resetting the buffer zone.

7. The method of claim 6, wherein the method further comprises:

and inputting the audio sequence into a pre-trained audio classification model, and obtaining an audio class label output by the audio classification model.

8. The method of claim 1, wherein the determining that the target audio segment is a voiced segment comprises:

and if the target audio energy value reaches a set audio energy threshold value, determining that the target audio segment is a sound segment.

9. The method of claim 8, wherein the calculating the target audio energy value corresponding to the target audio segment comprises:

10. An audio processing apparatus, characterized in that the apparatus comprises:

11. An electronic device, comprising:

a memory for storing one or more programs;

a processor;

the one or more programs, when executed by the processor, implement the method of any of claims 1-9.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.