CN114827756B

CN114827756B - Audio data processing method, device, equipment and storage medium

Info

Publication number: CN114827756B
Application number: CN202210471428.9A
Authority: CN
Inventors: 曹启云; 杨咏臻; 黄佳维
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2023-03-21
Anticipated expiration: 2042-04-28
Also published as: CN114827756A

Abstract

The present disclosure provides an audio data processing method, apparatus, device and storage medium, which relate to the technical field of artificial intelligence, and in particular to the technical field of intelligent media and video streaming. The specific implementation scheme is as follows: acquiring audio data to be processed; performing voice activity detection VAD on the audio data to be processed to obtain VAD results; searching a first target audio segment in the audio data to be processed according to the VAD result; and segmenting the audio data to be processed according to the first target audio segment. In the embodiment of the present disclosure, the first target audio segment for segmentation may be searched for the audio data to be processed according to the VAD result, and the audio data to be processed may be segmented more flexibly according to the searched first target audio segment, so as to obtain a more accurate segmentation result.

Description

Audio data processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to the field of intelligent media and video streaming.

Background

With the popularization of the fifth Generation communication (5 th-Generation, 5G) technology, the real-time video streaming technology has rapidly developed and widely applied, and more users are living broadcast. However, the content, duration and the like of many live broadcasts cannot be controlled, and a user may involve various sensitive topics in order to increase the traffic during the live broadcast. The platform needs effective means to monitor live content in real time, such as auditing voice in a video stream, to prevent such behavior from occurring.

Disclosure of Invention

The present disclosure provides an audio data processing method, apparatus, device, and storage medium.

According to an aspect of the present disclosure, there is provided an audio data processing method including:

acquiring audio data to be processed;

performing voice activity detection VAD on the audio data to be processed to obtain VAD results;

searching a first target audio segment in the audio data to be processed according to the VAD result;

and segmenting the audio data to be processed according to the first target audio segment.

According to another aspect of the present disclosure, there is provided an audio data processing apparatus including:

the acquisition module is used for acquiring audio data to be processed;

the VAD module is used for carrying out VAD on the audio data to be processed to obtain VAD results;

the searching module is used for searching a first target audio segment in the audio data to be processed according to the VAD result;

and the segmentation module is used for segmenting the audio data to be processed according to the first target audio segment.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above.

According to yet another aspect of the disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method according to any of the above.

According to yet another aspect of the disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method according to any of the above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic flow chart diagram of an audio data processing method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of an audio data processing method according to another embodiment of the present disclosure;

FIG. 3 is a schematic flow chart diagram of an audio data processing method according to another embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram of an audio data processing method according to another embodiment of the present disclosure;

FIG. 5 is a schematic flow chart diagram of an audio data processing method according to another embodiment of the present disclosure;

FIG. 6 is a schematic flow chart diagram of an audio data processing method according to another embodiment of the present disclosure;

FIG. 7 is a schematic flow chart diagram of an audio data processing method according to another embodiment of the present disclosure;

FIG. 8 is a schematic flow chart diagram of an audio data processing method according to another embodiment of the present disclosure;

fig. 9 is a schematic configuration diagram of an audio data processing apparatus according to an embodiment of the present disclosure;

fig. 10 is a schematic configuration diagram of an audio data processing apparatus according to another embodiment of the present disclosure;

fig. 11 is a schematic configuration diagram of an audio data processing apparatus according to another embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of a system to which the audio data processing method of the embodiment of the present disclosure is applied;

FIG. 13 is a schematic diagram of various caches;

fig. 14 is a schematic diagram of an exemplary flow of an audio data processing method according to an embodiment of the present disclosure;

fig. 15 is a block diagram of an electronic device for implementing an audio data processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic flow diagram of an audio data processing method according to an embodiment of the present disclosure. The method can comprise the following steps:

s101, acquiring audio data to be processed;

s102, carrying out Voice Activity Detection (VAD) on the audio data to be processed to obtain a VAD result;

s103, searching a first target audio segment in the audio data to be processed according to the VAD result;

and S104, segmenting the audio data to be processed according to the first target audio segment.

In the disclosed embodiment, analog signals such as voice, non-voice, etc. can be included in the original audio data. The original audio data is decoded, etc., and the analog signal can be converted into a digital signal to obtain the audio data to be processed, such as a binary code. And performing subsequent detection, searching, segmentation and other processing based on the audio data to be processed. In one example, data, such as a media stream, which may have a voice signal, may be encoded, original audio data (which may also be referred to as audio stream data) is extracted from the media stream data, and then the original audio data is decoded to obtain audio data to be processed. The media stream data may include live data or the like that is generated in real time.

In embodiments of the present disclosure, VAD may also be referred to as voice endpoint detection, voice boundary detection, and the like. With VAD it can be detected whether the state of speech is silent or active. The silent state may be referred to as a silent state and the active state may be referred to as an un-silent state. By detecting the audio data to be processed by VAD, the data in a mute state and/or the data in a non-mute state can be identified from the audio data to be processed. That is, the VAD result may include which data in the audio data to be processed is data in a mute state and which data is data in a non-mute state.

In the embodiment of the disclosure, the first target audio segment for segmentation can be searched for the audio data to be processed according to the VAD result, and the audio data to be processed can be more flexibly segmented according to the searched first target audio segment, so that a more accurate segmentation result is obtained. Therefore, the segmented fragments have complete semantics. For example, in a scene needing voice auditing, the audio data processing method of the embodiment of the disclosure can reduce the probability of segmenting the same sentence into different segments, thereby improving the accuracy of voice auditing of the segments obtained based on segmentation subsequently and reducing false reports. For another example, in a segmentation scene of a super-long video, the audio data processing method according to the embodiment of the present disclosure may also avoid that the same sentence is segmented into different segments as much as possible.

Fig. 2 is a flowchart illustrating an audio data processing method according to another embodiment of the present disclosure. The method of this embodiment includes one or more features of the audio data processing method embodiments described above. In a possible embodiment, the audio data to be processed is stored in a first storage unit and the VAD result is stored in a second storage unit. In the embodiment of the present disclosure, the storage unit may be a buffer, or may be a unit with a storage function, such as a queue.

In one possible implementation manner, in S103, searching for a first target audio segment in the audio data to be processed according to the VAD result includes:

s201, under the condition that the audio time corresponding to the audio data to be processed reaches a first time length according to the first storage unit and/or the second storage unit, searching a first target audio segment in the audio data to be processed according to the VAD result.

In the embodiment of the present disclosure, the audio duration of the audio data to be processed may gradually increase with time. For example, to-be-processed audio data is extracted from real-time media stream data, and when the extracted to-be-processed audio data reaches a certain time threshold, for example, 60 seconds(s), subsequent detection, search, and other processing may be performed on the to-be-processed audio data with the audio time of 60s. The first storage unit and/or the second storage unit may be a buffer or a queue. The audio duration of the audio data to be processed, which is accumulated in the buffer or the queue continuously, may be monitored to determine whether the audio duration of the audio data to be processed reaches the set first duration.

In the disclosed embodiments, the first duration may represent a threshold value of the audio duration. The specific value of the first duration can be flexibly set according to requirements. For example, the first duration may be 60s, 120s, or the like, or may be other values, which is not limited in the embodiment of the present disclosure.

In the embodiment of the present disclosure, the audio data to be processed may be stored in the first storage unit, and then the audio data to be processed in the first storage unit is VAD processed, and the VAD result is stored in the second storage unit. After the audio data to be processed is extracted, the extracted audio data to be processed is stored in the first storage unit, the VAD is performed on the extracted data, and the VAD result is stored in the second storage unit. Data that is VAD at a time may be considered a detected segment of the audio data to be processed. The length of the detection segment can be set according to requirements. For example, the duration of the detection segment may be 10 milliseconds (ms), 20ms, 30ms, and so on. The VAD result in the second storage unit and the detection segment of the audio data to be processed in the first storage unit have a corresponding relation.

In the embodiment of the present disclosure, the first storage unit may be detected to determine whether the audio duration of the audio data to be processed reaches the first duration, and the second storage unit may also be detected to determine whether the audio duration of the audio data to be processed reaches the first duration.

For example, the first storage unit is a first cache, the second storage unit is a second cache, and the storage upper limit of the first cache and/or the second cache is set to be a first duration. If it is detected that the first buffer and/or the second buffer reach the upper storage limit, it may indicate that the audio duration of the audio data to be processed reaches the first duration.

For another example, the first storage unit is a first queue, the second storage unit is a second queue, and the upper storage limit of the first queue and/or the second queue is set to be the first duration. If it is detected that the first queue and/or the second queue reaches the upper storage limit, it may indicate that the audio duration of the audio data to be processed reaches the first duration.

For another example, a time difference between data stored at the current time of the first storage unit and data stored at the start of the first storage unit is detected. If the time difference is detected to be a first time period, such as 60s, it can indicate that the audio time period of the audio data to be processed reaches the first time period.

For another example, the data amount from the VAD result stored at the current time in the second storage unit to the VAD result stored at the start point in the first storage unit is detected. If the detected amount of data reaches the amount of data that should be stored for a first time period, such as 60s, it may indicate that the audio duration of the audio data to be processed reaches the first time period.

In the embodiment of the present disclosure, the to-be-processed audio data and the VAD result are respectively stored in the plurality of storage units, and the time duration of the to-be-processed audio data processed each time can be controlled according to a certain time duration, for example, the first time duration, so that the processing speed can be controlled, and the segmentation can be performed more flexibly.

In the embodiment of the present disclosure, S201 may be a process of loop processing, and each time the audio duration of the audio data to be processed reaches the first duration, the first target audio segment is searched for in the audio data to be processed of the first duration according to the VAD result. And then performing segmentation by using the first target audio segment. Therefore, the method can be more suitable for scenes with increasing audio data to be processed, such as scenes for continuously extracting the audio data to be processed from media stream data such as live broadcast data.

In one possible implementation, the audio data to be processed is Pulse Code Modulation (PCM) data. The process of PCM may include: analog signals such as voice and image are sampled at regular intervals, discretized, sampled values are rounded and quantized according to hierarchical units, and the sampled values represent the amplitude of sampled pulses according to a group of binary codes. In the disclosed embodiment, the PCM data may be obtained by PCM the original audio data. The original audio data may also be decoded in other manners similar to PCM to obtain data to be processed, which is not limited in this disclosure.

In one possible implementation, as shown in fig. 3, in S102, performing voice activity detection VAD on the audio data to be processed to obtain a VAD result, including:

s301, reading a detection segment with a second duration from the audio data to be processed in the first storage unit each time, and performing VAD to obtain a silence flag of the detection segment, wherein the silence flag is used for indicating that the detection segment is in a silence state or a non-silence state;

s302, storing the mute mark of the detection segment into the second storage unit to obtain the VAD result.

In the disclosed embodiment, for example, the VAD result includes a mute flag for each detected segment of PCM data, the mute flag indicating whether the detected segment of PCM data is in a mute state or in a non-mute state. The second time period for detecting the segment per reading may be set in advance. For example, the second duration of the detection segment may be 10 milliseconds (ms), 20ms, 30ms, and so on. If the total time length of the audio data to be processed is 60s and the time length of the detection segment read each time is 30ms, the number of the detection segments is 2000. If VAD is performed for each detection segment resulting in 1 mute flag, 2000 mute flags may be included in the second storage unit.

In the embodiment of the present disclosure, S301 and S302 may be a process of loop processing, and for the audio data to be processed of the first duration, the detection segments of the second duration may be processed each time until the audio data to be processed of the first duration is completely processed.

In the disclosed embodiment, a mute flag may be used to indicate whether a detected segment is in a mute state or in a non-mute state. For example, the mute flag of the detection segment L1 is 1, which indicates that L1 is in a mute state, and the mute flag of the detection segment L1 is 0, which indicates that L1 is in a non-mute state. For another example, a mute flag of detection segment L1 is 0 indicating that L1 is in a mute state, and a mute flag of detection segment L1 is 1 indicating that L1 is in a non-mute state. The specific value and the corresponding meaning of the mute flag can be flexibly set according to the requirement, and are not limited in the embodiment of the disclosure.

In the embodiment of the disclosure, the audio data to be processed is detected in a segmented manner according to a certain time length, so that an accurate VAD result can be obtained, the detection precision, accuracy, processing speed and the like can be adjusted by flexibly setting the time length of the detection segment, and the method and the device can be suitable for richer application scenarios.

In one possible implementation, as shown in fig. 4, in S103 or S201, searching for a first target audio segment in the audio data to be processed according to the VAD result includes:

s401, according to the silence flag of the detection segment in the audio data to be processed included in the VAD result, searching a first target audio segment in the audio data to be processed, where the first target audio segment includes a detection segment in a silence state or a plurality of detection segments in consecutive silence states in the audio data to be processed.

In the embodiment of the present disclosure, if the to-be-processed audio data to be processed includes a plurality of detection segments, after VAD is performed on each of the detection segments, a silence flag corresponding to each of the detection segments may be obtained. For example, the audio data to be processed includes N detection segments, where M detection segments may be in a mute state. M is a positive integer less than or equal to N. If only one detected segment in a mute state is found, for example, M is equal to 1, this detected segment in a mute state can be used as the found first target audio segment. If detected segments of multiple mute states are found, e.g., M is greater than 1, the first target audio segment may be determined based on the detected segments of M mute states. If the M detected segments of silence state are consecutive, the M detected segments of silence state can be used as a first target audio segment. If the M detected silence state segments are not continuous, the first target audio segment can be determined according to the lengths of the continuous silence state detected segments in the M detected silence state segments. For example, the M detection segments are divided into M1 continuous detection segments in a mute state and M2 continuous detection segments in a mute state. Wherein M1 and M2 are positive integers greater than or equal to 1, and M is equal to the sum of M1 and M2. If the duration of M2 is greater than the duration of M1, M2 may be considered the first target audio piece.

In the embodiment of the disclosure, according to the mute flag of one or more detection segments in the audio data to be processed, the determined first target audio segment may include one detection segment in a mute state or multiple detection segments in a continuous mute state, and the segmentation may be performed based on the first target audio segment in the mute state in the audio data to be processed, so as to obtain a more accurate segmentation result, and reduce redundancy.

In one possible implementation, as shown in fig. 5, in S401, searching for a first target audio segment in the to-be-processed audio data according to a silence flag of a detected segment in the to-be-processed audio data included in the VAD result includes:

s501, a silence segment is searched in a first audio time range corresponding to the audio data to be processed to serve as the first target audio segment, the silence segment comprises a detection segment in a silence state or a plurality of continuous detection segments in the silence state, and the audio time length corresponding to the first audio time range is smaller than the audio time length corresponding to the audio data to be processed.

In the embodiment of the present disclosure, when the audio data to be processed in the first storage unit reaches the first duration, the first target audio segment may be searched from a starting point of the audio data to be processed, or may be searched from a certain set time point. The time range in which the first target audio piece is preferentially searched, i.e., the first audio time range, may be determined based on the time point at which the search is started. The first audio time range may be a portion of the first time length of audio data to be processed. The audio time length corresponding to the first audio time range is smaller than the first time length corresponding to the audio data to be processed. For example, the first audio time range may be located in the second half of the audio data to be processed, in the first half of the audio data to be processed, or in the middle of the audio data to be processed. If the first target audio segment which meets the requirement cannot be searched in the first audio time range, the first target audio segment can be searched in other time ranges outside the first audio time range of the audio data to be processed. The first target audio clip is preferably searched within a certain time range, so that the position of the first target audio clip can be determined quickly, the searching speed can be increased, and the overall data processing speed is increased.

In a possible implementation, if a first audio time point is taken as a time point at which the search starts, the first audio time range may include a time range from the first audio time point of the audio data to be processed to an end point of the audio data to be processed. For example, the total duration of the audio data to be processed, i.e. the first duration, which needs to be processed is 60s, the first audio time point is 40s, and the first audio time range may be 40s to 60s. For another example, the first duration of the to-be-processed audio data to be processed is 120s, the first audio time point is 60s, and the first audio time range may be from 60s to 120s. The first audio time range which is searched preferentially is set to be from a certain time point to the end point of the audio data to be processed, so that the first target audio fragment can be searched out quickly at the position close to the end of the audio data to be processed. Under the condition of circularly processing the audio data to be processed, repeated detection can be reduced, and the first target audio fragment can be found out more quickly and reasonably. For example, after the first slicing, the start point of the audio data to be processed remaining in the first storage unit is non-mute data; when the second time of cutting, the first target audio frequency segment can not be searched from the starting point of the audio frequency data to be processed, thereby reducing repeated detection and improving the detection efficiency.

In a possible implementation manner, in S501, searching a silence segment as the first target audio segment in a first audio time range corresponding to the audio data to be processed includes: and under the condition that a plurality of mute sections exist in the first audio time range, determining the corresponding mute section with the longest audio time length as the first target audio section. For example, if the first duration of the audio data to be processed is 60s, the plurality of silence segments M1, M2, and M3 are found at 40s to 60s in the first audio time range. M1 has a duration of 3s, M2 has a duration of 10s, and M3 has a duration of 1s, so that M2 having the longest duration can be reserved as the subsequent first target audio piece. The mute segment with the longest duration in the first audio time range is reserved, and subsequently reasonable segmentation can be performed on the basis of the mute segment with the longest duration in the first audio time range, so that redundant data can be reduced to a greater extent.

In one possible implementation manner, in S401, searching for a first target audio segment in the audio data to be processed according to a silence flag of a segment detected in the audio data to be processed included in the VAD result includes:

s502, under the condition that no mute segment exists in a first audio time range corresponding to the audio data to be processed, searching the first target audio segment in a second audio time range corresponding to the audio data to be processed, wherein the mute segment comprises a detection segment in a mute state or a plurality of continuous detection segments in the mute state.

In the embodiment of the present disclosure, the audio data to be processed may include a first audio time range, and may also include a second audio time range. The first audio time range and the second audio time range are different and may or may not have boundary points. The second audio time range may be lower in lookup priority than the first audio time range.

In the embodiment of the present disclosure, the first audio time range and the second audio time range may be divided in various ways. In one mode, a certain time point in the audio data to be processed of the first duration is taken as a dividing position, one part of the audio data to be processed is divided into a first audio time range, and the other part of the audio data to be processed is divided into a second audio time range. For example, the first duration of the audio data to be processed is 60s, and is divided at 40s, where 0s to 40s are the second audio time range, and 40s to 60s are the first audio time range. In another mode, the first audio time range and the second audio time range do not include end points of the audio data to be processed, and do not have boundary points. For example, the first duration of the audio data to be processed is 60s, the 10 th to 30 th audio time ranges are the second audio time range, and the 35 th to 50 th audio time ranges are the first audio time range. The above dividing manner is merely an example and is not limited, and may be specifically selected according to the requirements of the actual application scenario. If no suitable first target audio segment exists in the first audio time range of the audio data to be processed, the search range can be expanded, and the search can be performed in the second audio time range of the audio data to be processed, so that the more suitable first target audio segment can be obtained.

In one possible embodiment, the first audio time range and the second audio time range may have the first audio time point as a dividing point. The first audio time range includes a time range from a first audio time point of the audio data to be processed to an end point of the audio data to be processed. The second audio time range includes a time range from a first audio time point of the to-be-processed audio data to a start point of the to-be-processed audio data. For example, the first duration of the audio data to be processed that needs to be processed is 60s, the first audio time point is 40s, the first audio time range may be 40s to 60s, and the second audio time range may be 0s to 40s. For another example, the first duration of the audio data to be processed is 120s, the first audio time point is 60s, the first audio time range may be 60s to 120s, and the second audio time range may be 0s to 60s. The time range of the second audio frequency searched for in a supplementing mode is set to be from a certain time point to the starting point of the audio frequency data to be processed, under the condition that the searching result in the first audio frequency time range is not appropriate, the other time ranges of the audio frequency data to be processed are used for searching in an assisting mode, the more appropriate first target audio frequency fragment is obtained, and then subsequent segmentation is achieved.

In a possible implementation manner, in S502, searching for the first target audio segment within a second audio time range corresponding to the audio data to be processed includes: and in the second audio time range, taking a first silence segment found forward from a first audio time point corresponding to the audio data to be processed as the first target audio segment, where the first audio time range is a time range from the first audio time point to an end point of the audio data to be processed, and the second audio time range is a time range from the first audio time point to a start point of the audio data to be processed. This is advantageous for improving the search efficiency and for retaining longer data in the first time period of the audio data to be processed. In addition, the detection may also be performed backward from the beginning of the second audio time range to find all the silence segments in the second audio time range, and the silence segment with the longest duration is reserved, which is beneficial to reduce redundancy to the greatest extent.

In a possible implementation manner, in S401, searching for a first target audio segment in the audio data to be processed according to a silence flag of a segment detected in the audio data to be processed included in the VAD result includes:

s503, under the condition that a detection segment to which a second audio time point corresponding to the audio data to be processed belongs is in a mute state, searching for an end point of the first target audio segment in a third audio time range corresponding to the audio data to be processed, searching for a start point of the first target audio segment in a fourth audio time range corresponding to the audio data to be processed, and determining the first target audio segment based on the end point and the start point, wherein the third audio time range is a time range from the second audio time point to the end point of the audio data to be processed, and the fourth audio time range is a time range from the second audio time point to the start point of the audio data to be processed.

In the embodiment of the present disclosure, S503 may be executed in combination with S501 and/or S502, and S503 may also be executed independently from S501 and/or S502. For example, S501 is executed first, and then S503 is executed. For another example, S502 is executed first, and then S503 is executed. For another example, S501 and S502 are executed first, and then S503 is executed. For another example, only S503 is performed, and S501 and S502 are not performed. The second audio time point may be the same as or different from the first audio time point. The third audio time range may be the same as or different from the first audio time range. The fourth audio time range may be the same as or different from the second audio time range.

In the embodiment of the present disclosure, if, in the audio data to be processed, the third audio time range and the fourth audio time range are divided with the second audio time point as a dividing point, and the detected section to which the second audio time point belongs is a mute state, it indicates that there is a possibility that there is a detected section of a mute state that is continuous with the detected section to which the second audio time point belongs, is searched forward at the second audio time point. Under the condition that the detection segment to which the first audio time point belongs is in a mute state, the starting point of the first target audio segment is searched forwards from the second audio time point, and the end point of the first target audio segment is searched backwards, so that the time range of the first target audio segment can be expanded, and the data redundancy can be reduced to a greater extent. For example, the first duration of the audio data to be processed that needs to be processed is 60s, the second audio time point is 40s, the third audio time range may be 40s to 60s, and the fourth audio time range may be 0s to 40s. If the detection segment to which the 40 th s belongs is in a mute state, it can be searched from the 40 th s onward in the 0 th s to the 40 th s, and belongs to a non-mute state in the 20 th s. And, in the 40 th to 60 th s, it is possible to search backward from the 40 th s, and belong to the non-mute state at the 50 th s. In this case, the first target audio piece may be a detected piece including consecutive mute states from the 20 th s to the 50 th s.

In one possible implementation manner, in S503, the end point of the first target audio segment is found in the third audio time range of the audio data to be processed, which includes one of the following:

in the third audio time range, taking the starting point of the detected segment in the first non-silent state found backwards from the second audio time point as the end point of the first target audio segment;

and taking the end point of the third audio time range as the end point of the first target audio segment when the detection segment in the non-mute state is not found in the third audio time range.

For example, the first time duration of the audio data to be processed that needs to be processed is 60s, the second audio time point is 40s, and the third audio time range may be 40s to 60s. Looking up backward from the 40 th s of the second audio time point, if the starting point of the first detected section of the non-mute state found first in the 40 th to 60 th s is 45 th s, the 45 th s can be taken as the end point of the first target audio section. If the detected segment in the non-mute state is not found in the 40 th to 60 th s, which indicates that the detected segments in the 40 th to 60 th s are all in the mute state, the 60 th s can be taken as the end point of the first target audio segment.

In the embodiment of the present disclosure, the starting point of the detected segment in the first non-mute state found backward from the second audio time point in the third audio time range is used as the end point of the first target audio segment, and the end point of the first target audio segment can be quickly determined. Further, if all the detected segments of the third audio time range are in a mute state, the end of the third audio time range is taken as the end of the first target audio segment, and the range of the first target audio segment can be maximized.

In one possible implementation manner, in S503, the searching for the starting point of the first target audio segment in the fourth audio time range of the audio data to be processed includes one of the following:

in the fourth audio time range, taking the end point of the detected segment in the first non-silent state, which is found forward from the second audio time point, as the starting point of the first target audio segment;

and taking the starting point of the fourth audio time range as the starting point of the first target audio segment when the detected segment in the non-mute state is not found in the fourth audio time range.

In the above example, the fourth audio time range may be 0s to 40s. Looking up from the 40 th s of the second audio time point, if the starting point of the first detected section of the non-mute state looked up first in the 0 th s to the 40 th s is the 25 th s, the 25 th s can be taken as the starting point of the segment of the partitional sound. If the detected sections in the non-mute state are not found in the 0 th s to the 40 th s, which indicates that the detected sections in the 0 th s to the 40 th s are all in the mute state, the 0 th s can be taken as the end point of the first target audio section.

S501, S502, and S503 have no timing limitation, and can be flexibly set according to requirements.

In the embodiment of the present disclosure, the end point of the detected segment in the first non-silent state, which is located in the fourth audio time range from the second audio time point forward, is used as the start point of the first target audio segment, and the start point of the first target audio segment can be quickly determined. Further, if all the detected sections of the fourth audio time range are in a mute state, the range of the first target audio section can be maximized by taking the start of the fourth audio time range as the start of the first target audio section.

In one possible implementation manner, in S104, segmenting the audio data to be processed according to the first target audio segment includes: and segmenting the audio data to be processed based on the starting point and the end point of the first target audio segment.

For example, if the duration of the first target audio segment determined according to any of the above manners is less than the first duration, the audio data to be processed of the first duration may be segmented based on the start point and the end point of the first target audio segment. Therefore, the segmented second target audio fragment has complete semantics, and redundancy is reduced to a greater extent.

In one possible implementation, as shown in fig. 6, the segmenting the audio data to be processed based on the start point and the end point of the first target audio segment includes:

s601, segmenting the audio data to be processed in the first storage unit at the starting point and the ending point of the first target audio segment;

s602, storing data before the starting point of the first target audio clip as a second target audio clip;

s603, deleting the first target audio clip from the first storage unit;

s604, continuing to accumulate the audio duration of the audio data to be processed in the first storage unit from the end point of the first target audio clip.

S602, S603, and S604 described above have no timing limitation, and can be flexibly set according to requirements.

In the embodiment of the present disclosure, the audio data to be processed of the first duration in the first storage unit may be segmented by using the start point and the end point of the first target audio segment. The data preceding the start of the first target audio piece is saved to a set storage location as a second target audio piece, e.g., a PCM piece. And deleting the first target audio segment, and keeping the data after the end point of the first target audio segment in the first storage unit to accumulate the next audio data to be processed.

For example, if the first duration is 60s and the first target audio segment is 25s to 45s, the 0s to 25s are saved as a second target audio segment, and the 45s to 60s are kept in the first buffer or the first queue for accumulation. In this case, the 45 th time may be changed to the 0 th time of the starting point of the audio data to be processed which needs to be processed next time, and after the first duration is accumulated for 60 seconds, the first target audio segment is searched in the first buffer or the first queue again and is segmented. The audio data to be processed in the first storage unit is segmented by using the starting point and the end point of the first target audio segment, and the data before the starting point of the first target audio segment is stored as a second target audio segment, so that the second target audio segment has more complete semantics, and the redundancy can be greatly reduced by deleting the first target audio segment.

In one possible embodiment, the method further comprises one of:

deleting the audio data to be processed in a first storage unit under the condition that all detection segments of the audio data to be processed are in a mute state;

and under the condition that all the detected sections of the audio data to be processed are in a non-mute state, storing the audio data to be processed in the first storage unit as a second target audio section.

For example, if the duration of the first target audio segment, e.g., 60s, is equal to the first duration according to any of the above manners, all detected segments of the first duration of the audio data to be processed are in a mute state, and the first duration of the audio data to be processed may be deleted from the first storage unit, e.g., the first buffer or the first queue. The audio data to be processed in the mute state in the first storage unit is deleted, so that redundant data can be reduced to a large extent.

For another example, if the first target audio segment cannot be found according to any of the above manners, all the detected segments of the audio data to be processed of the first duration are in a non-mute state, and may be directly segmented at the end point of the audio data to be processed of the first duration. The audio data to be processed of the first duration is extracted from the first storage unit, such as the first buffer or the first queue, and is stored as a second target audio clip in a set storage location, such as a certain database. After saving the second target audio segment, the audio data to be processed of the first duration may be deleted from the first storage unit. And then resumes accumulating the audio data to be processed next time that needs to be processed. The audio data to be processed in the first storage unit which is in the non-mute state is used as a second target audio segment for storage, so that the integrity of the voice can be kept to a greater extent, and the accuracy of subsequent voice audit can be improved.

In the embodiment of the present disclosure, after the second target audio segment is extracted, the data that has been stored in the storage unit as the second target audio segment and corresponds to the first target audio segment may be deleted. The deleting operation of the first storage unit and the deleting operation of the second storage unit can be performed in linkage. For example, if the audio data to be processed of the first duration in the first storage unit is deleted, the VAD result corresponding to the audio data to be processed of the first duration in the second storage unit is also deleted. For another example, if the first target audio segment and the previous data in the first storage unit are deleted, the VAD result corresponding to the first target audio segment and the previous data in the second storage unit is also deleted.

In one possible embodiment, as shown in fig. 7, the method further comprises:

s701, obtaining playable audio data corresponding to a second target audio clip of the audio data to be processed from the original audio data stored in the third storage unit.

In the disclosed embodiments, decoding media stream data may result in raw video data (which may be referred to as video stream data) and raw audio data (which may be referred to as audio stream data). Wherein the audio stream data comprises playable audio data. In the storage stage, the audio data to be processed converted from the original audio data may be stored in the first storage unit, the VAD result of the audio data to be processed may be stored in the second storage unit, and the original audio data may be copied to the third storage unit. For example, playable audio data corresponding to the audio data to be processed in the first buffer may be stored in the third buffer. As another example, playable audio data corresponding to the pending audio data in the first queue may be saved in the third queue. Therefore, richer data resources can be saved, the playable audio data can be conveniently utilized for auxiliary audit, and a more accurate audit result is obtained.

In the embodiment of the present disclosure, the actions of the first storage unit, the second storage unit, and the third storage unit may be performed in linkage. For example, if the second target audio segment includes the 25 th s to 45 th s of the audio data to be processed of the first duration in the first storage unit, the 25 th s to 45 th s of the original audio data of the first duration in the third storage unit may be sliced into playable audio data correspondingly. Then, the playable audio data obtained by segmentation can be stored in the set storage position corresponding to the second target audio segment.

In the embodiment of the present disclosure, the playable audio data corresponding to the second target audio segment is extracted to be subsequently used as auxiliary data for auditing, and the data already stored in the storage unit can be deleted after the second target audio segment and the playable audio data are extracted, so that the occupation of a storage space such as a memory is reduced, and the operation speed is increased. And if the audio data to be processed with the first time length in the first storage unit is deleted, deleting the VAD result corresponding to the audio data to be processed with the first time length in the second storage unit, and deleting the original audio data corresponding to the audio data to be processed with the first time length in the third storage unit. For another example, if the first target audio segment and its previous data in the first storage unit are deleted, the VAD result corresponding to the first target audio segment and its previous data in the second storage unit is also deleted, and the original audio data corresponding to the first target audio segment and its previous data in the third storage unit is also deleted.

In one possible implementation, as shown in fig. 8, S101 includes:

s801, extracting the audio data to be processed from the media stream data through a process, wherein different media stream data are processed through different processes.

First, original audio data may be extracted from media stream data by a process, and different media stream data may be processed by different processes. For example, the original audio data is extracted from the media stream using a multimedia video processing tool such as an API (Application Programming Interface) of FFmpeg (Fast forwarding Picture Experts Group).

Then, the original audio data can be decoded through the process to obtain the audio data to be processed, so that the audio data to be processed can be extracted from different media stream data through different processes. Therefore, the concurrent processing capability of the process can be utilized to perform data processing on a plurality of media streams concurrently, thereby reducing the utilization rate of computing resources and improving the speed of processing the media stream data. Further, the number of concurrent processing of real-time media streams may be increased.

In one possible implementation, the process includes a plurality of threads, such as a first thread, a second thread, and a third thread. And data are interacted among different threads through the storage unit. Different threads may be responsible for different functions. For example, the first thread may be primarily responsible for the codec processing. The second thread may be primarily responsible for VAD. The third thread may be primarily responsible for the slicing. The three processes can exchange data through a storage unit such as a buffer or a queue. For example, the three processes may exchange data through the first storage unit, the second storage unit, and the third storage unit.

In a possible implementation manner, the first thread is configured to extract original audio data from media stream data, perform decoding processing on the original audio data, store the audio data to be processed obtained after decoding in the first storage unit, and store the original audio data in the third storage unit. For example, the first thread may perform encoding and decoding processing on the media stream data, store the encoded original audio data in the third storage unit, and store the audio data to be processed, such as PCM data, obtained by decoding the original audio data in the first storage unit.

In a possible embodiment, the second thread is configured to read the audio data to be processed from the first storage unit for VAD, and store the VAD result in the second storage unit.

In a possible implementation manner, the third thread is configured to, when it is detected that the audio duration of the audio data to be processed stored in the first storage unit reaches a first duration, search a first target audio segment in the audio data to be processed according to the VAD result, and segment the audio data to be processed according to the first target audio segment.

Referring to the above steps of fig. 1, a first thread may perform S101, a second thread may perform S102, and a third thread may perform S103 and S104. Referring to the above-described step of fig. 2, the third thread may perform S201. Referring to the steps of fig. 3 described above, the second thread may perform S301 and S302. Referring to the above-described step of fig. 4, the third thread may perform S401. Referring to the steps of fig. 5 described above, the third thread may perform at least one of S501, S502, and S503. Referring to the steps of fig. 6 described above, the third thread may perform S601 to S604. Referring to the steps of fig. 7 described above, the first thread may perform S701, and the third thread may perform S702. Referring to the above-described step of fig. 8, the first thread may perform S801.

In addition, the process can also comprise a main thread which is used for detecting whether the three threads are alive or not and rebuilding the threads if any thread is not alive.

In the embodiment of the disclosure, the resource utilization rate can be improved and the blocking can be reduced by a multithreading mode in the process.

Fig. 9 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present disclosure, which may include:

an obtaining module 901, configured to obtain audio data to be processed;

a VAD module 902, configured to perform VAD on the audio data to be processed to obtain a VAD result;

a searching module 903, configured to search a first target audio segment in the audio data to be processed according to the VAD result;

a segmentation module 904, configured to segment the audio data to be processed according to the first target audio segment.

In the embodiment of the disclosure, the first target audio segment for segmentation can be searched for the audio data to be processed according to the VAD result, and the audio data to be processed can be more flexibly segmented according to the searched first target audio segment, so that a more accurate segmentation result is obtained.

Fig. 10 is a schematic flow chart of an audio data processing apparatus according to another embodiment of the present disclosure. The apparatus of this embodiment comprises one or more features of the embodiments of the audio data processing apparatus described above. In a possible embodiment, the audio data to be processed is stored in a first storage unit, and the VAD result is stored in a second storage unit;

the searching module 903 is configured to search a first target audio segment in the audio data to be processed according to the VAD result when it is determined that the audio time corresponding to the audio data to be processed reaches a first time length according to the first storage unit and/or the second storage unit.

In a possible implementation manner, the VAD module 902 is configured to read a detection segment of a second duration from the audio data to be processed in the first storage unit each time to perform VAD, so as to obtain a silence flag of the detection segment, where the silence flag is used to indicate that the detection segment is in a silence state or a non-silence state; and storing the mute mark of the detection segment into the second storage unit to obtain the VAD result.

The searching module 903 is configured to search, according to a silence flag of a detected segment in the audio data to be processed included in the VAD result, a first target audio segment in the audio data to be processed, where the first target audio segment includes a detected segment in a silence state or a plurality of detected segments in consecutive silence states in the audio data to be processed.

In one possible implementation, as shown in fig. 10, the lookup module 903 includes: the first searching submodule 9031 is configured to search a mute segment in a first audio time range corresponding to the audio data to be processed as the first target audio segment, where the mute segment includes a detection segment in a mute state or multiple detection segments in a continuous mute state, and an audio duration corresponding to the first audio time range is smaller than an audio duration corresponding to the audio data to be processed. Therefore, the first target audio clip is preferentially searched within a certain time range, the position of the first target audio clip is favorably and quickly determined, the searching speed can be increased, and the overall data processing speed is increased.

In a possible implementation manner, the first searching submodule 9031 is configured to, in a case that multiple silence segments exist in the first audio time range, determine, as the first target audio segment, a corresponding silence segment with the longest audio duration. Therefore, the mute segment with the longest duration in the first audio time range is reserved, and subsequently reasonable segmentation can be carried out on the basis of the mute segment with the longest duration in the first audio time range, and redundant data can be reduced to a greater extent.

In a possible implementation, the search module 903 further includes: the second searching submodule 9032 is configured to search the first target audio segment in a second audio time range corresponding to the audio data to be processed, where the silence segment includes a detection segment in a silence state or multiple detection segments in continuous silence states, when no silence segment exists in the first audio time range corresponding to the audio data to be processed. Therefore, if the first target audio segment does not exist in the first audio time range of the audio data to be processed, the search range can be expanded, and the first target audio segment can be obtained conveniently by searching in the second audio time range of the audio data to be processed.

In a possible implementation manner, the second searching submodule 9032 is configured to use a first silence segment searched forward from a first audio time point corresponding to the audio data to be processed as the first target audio segment in the second audio time range, where the first audio time range is a time range from the first audio time point to an end point of the audio data to be processed, and the second audio time range is a time range from the first audio time point to a start point of the audio data to be processed. This is advantageous for improving the search efficiency and for retaining longer data in the first time period of the audio data to be processed.

In a possible implementation, the search module 903 further includes: a third searching submodule 9033, configured to search, in a third audio time range corresponding to the to-be-processed audio data, an end point of the first target audio piece in the third audio time range corresponding to the to-be-processed audio data, search a start point of the first target audio piece in a fourth audio time range corresponding to the to-be-processed audio data, and determine the first target audio piece based on the end point and the start point, where the third audio time range is a time range from the second audio time point to the end point of the to-be-processed audio data, and the fourth audio time range is a time range from the second audio time point to the start point of the to-be-processed audio data. Therefore, the starting point of the first target audio segment is searched forwards from the second audio time point, and the end point of the first target audio segment is searched backwards, so that the time range of the first target audio segment can be expanded, and the data redundancy can be reduced to a greater extent.

In a possible implementation, the third searching submodule 9033 is configured to search for an end point of the first target audio piece in a third audio time range of the audio data to be processed, where the end point of the first target audio piece includes one of:

In the embodiment of the present disclosure, the start point of the detection of the first non-mute state found backward from the second audio time point in the third audio time range is used as the end point of the first target audio segment, and the end point of the first target audio segment can be quickly determined. Further, if all the detected segments of the third audio time range are in a mute state, the end of the third audio time range is taken as the end of the first target audio segment, and the range of the first target audio segment can be maximized.

In a possible implementation, the third searching submodule 9033 is configured to search, in a fourth audio time range of the audio data to be processed, a starting point of the first target audio segment, where the starting point of the first target audio segment is one of:

and under the condition that the detection segment in the non-mute state is not found in the fourth audio time range, taking the starting point of the fourth audio time range as the starting point of the first target audio segment.

In the embodiment of the present disclosure, the end point of the detected segment of the first non-mute state found forward from the second audio time point in the fourth audio time range is used as the start point of the first target audio segment, and the start point of the first target audio segment can be quickly determined. Further, if all the detected sections of the fourth audio time range are in a mute state, the range of the first target audio section can be maximized by taking the start of the fourth audio time range as the start of the first target audio section.

In a possible implementation manner, the slicing module 904 is configured to slice the audio data to be processed based on a start point and an end point of the first target audio segment. Therefore, the second target audio fragment obtained by segmentation has complete semantics, and redundancy is reduced to a greater extent.

In one possible embodiment, as shown in fig. 10, the apparatus further comprises one of:

a deleting module 1001, configured to delete the to-be-processed audio data in a first storage unit when all detected segments of the to-be-processed audio data are in a mute state;

a saving module 1002, configured to save the to-be-processed audio data in the first storage unit as a second target audio fragment when all detected fragments of the to-be-processed audio data are in a non-mute state.

In the embodiment of the present disclosure, the audio data to be processed in the mute state in the first storage unit is deleted, so that redundant data can be reduced to a greater extent. The audio data to be processed in the first storage unit which is in the non-mute state is used as a second target audio segment for storage, so that the integrity of the voice can be kept to a greater extent, and the accuracy of subsequent voice audit can be improved. And segmenting the audio data to be processed with the first duration at the first target audio segment, so that the segmented second target audio segment has more complete semantics and the redundancy is reduced to a greater extent.

In a possible implementation, the segmenting module 9043 is configured to segment the audio data to be processed based on a start point and an end point of the first target audio segment, and includes:

segmenting the audio data to be processed in the first storage unit at the starting point and the ending point of the first target audio segment;

saving data before the starting point of the first target audio segment as a second target audio segment;

deleting the first target audio piece from the first storage unit;

and continuing to accumulate the audio duration of the audio data to be processed in the first storage unit from the end point of the first target audio clip.

In the embodiment of the disclosure, the audio data to be processed in the first storage unit is segmented by using the start point and the end point of the first target audio segment, and the data before the start point of the first target audio segment is stored as the second target audio segment, so that the second target audio segment has more complete semantics, and the redundancy can be greatly reduced by deleting the first target audio segment.

In one possible implementation, as shown in fig. 10, the obtaining module 901 is configured to extract the audio data to be processed from media stream data by processes, where different media stream data are processed by different processes. Therefore, the concurrent processing capacity of the process can be utilized to concurrently process data of a plurality of media streams, thereby reducing the utilization rate of computing resources and improving the speed of processing the media stream data.

In one possible implementation, the process comprises a first thread, a second thread and a third thread, and data are interacted among different threads through a storage unit; wherein:

the first thread is used for extracting original audio data from media stream data, decoding the original audio data, storing the audio data to be processed obtained after decoding into a first storage unit, and storing the original audio data into a third storage unit;

the second thread is used for reading the audio data to be processed from the first storage unit for VAD, and storing VAD results into the second storage unit;

the third thread is configured to, when it is detected that the audio duration of the to-be-processed audio data stored in the first storage unit reaches a first duration, search a first target audio segment in the to-be-processed audio data according to the VAD result, and segment the to-be-processed audio data according to the first target audio segment.

In a possible embodiment, as shown in fig. 11, the apparatus further comprises: a storage module 1101, configured to store the original audio data in a third storage unit. The storage module 1101 is further configured to store the to-be-processed audio data obtained by converting the original audio data into the first storage unit. The storage module 1103 is further configured to store a VAD result obtained by performing VAD on the audio data to be processed in the second storage unit. Therefore, richer data resources can be saved, the playable audio data can be conveniently utilized for auxiliary audit, and a more accurate audit result is obtained.

In a possible implementation manner, the cutting module 904 is further configured to obtain playable audio data corresponding to the second target audio segment of the audio data to be processed from the original audio data stored in the third storage unit. In the embodiment of the present disclosure, the cutting module extracts the playable audio data corresponding to the second target audio segment, so as to be used as auxiliary data for auditing in the following. After the second target audio clip and the playable audio data are extracted, the data stored in the storage unit can be deleted, so that the occupation of a storage space such as a memory is reduced, and the operation speed is improved.

For a description of specific functions and examples of each module and each sub-module of the audio data processing apparatus in the embodiment of the present disclosure, reference may be made to the related description of the corresponding step in the embodiment of the audio data processing method, and details are not repeated here.

In a media stream scenario, the main ways of monitoring a real-time media stream, such as live voice content, may include manual review, rule review, artificial Intelligence (AI) capability review, and the like. Wherein, the efficiency of manual review is not high, and easy tired appear misjudgement and miss the judgement, is not suitable for the platform that live broadcast quantity is big. The rule auditing depends on the experience summarized by manual auditing to form an expert system, and the expert system needs to be continuously supplemented and corrected and is easy to miss the judgment. The AI capability audit segments the audio according to a fixed duration, resulting in a large probability of a speech being segmented into two segments, which may result in a subsequent AI model failing to correctly identify whether the speech is a risky speech, thereby causing misjudgment and missed judgment.

The real-time media stream voice auditing system based on VAD algorithm provided by the embodiment of the disclosure can use any audio data processing method in the embodiment of the disclosure. The system can combine the streaming media technology and the VAD (Voice Activity Detection) algorithm, thereby saving the computing resources and improving the processing efficiency. The streaming media technology can be used for separating and extracting audio data from the media stream in real time and converting the audio data into required data through the coding and decoding technology. The VAD algorithm can be used for accurately positioning the starting point and the ending point of the voice from the voice with noise, so that the two words are separated as far as possible, and the segmented audio meets the requirement of the AI algorithm model input.

In order to ensure the reasonableness of real-time performance and voice segmentation, the real-time media stream voice auditing system based on the VAD algorithm may include the following parts, and specific system components and linkage relations can be seen in fig. 12:

(1) The media stream audio extraction coding and decoding module 1201: the original audio data is extracted from the media stream using audio video codec technology using a multimedia video processing tool such as the API of FFmpeg. Also, each original audio data (packet for short) may be decoded into audio data to be processed, such as PCM original data (PCM data for short), facilitating subsequent processing. While decoding, the original audio data can also be encoded into audio data that can be played (playable audio data for short), and can be used for secondary review or retention. The PCM data can be stored in the PCM buffer, and the playable audio data can be stored in the original audio data buffer. The buffer in this example may also be a queue, and the following description takes the buffer as an example, and the principle of the queue is similar, and reference may be made to the example of the buffer.

(2) The VAD module 1202: for example, the VAD module in the Web Real-Time Communication (WebRTC) can be extracted separately and integrated into the system. The input to the VAD module may include PCM raw data of the audio, the duration may be fixed to 10ms, 20ms, 30ms, etc., and a very short segment of the audio, and the output of the VAD module may be a mute flag such as 0 or 1, which represents mute or non-mute. Before the VAD module is used for detection, noise Suppression (NS) and other processing may be performed on the audio data to reduce Noise signals in the audio data. In addition, the silence flags output by the VAD module may be stored in a silence flag buffer.

(3) Audio slicing module 1203: since the real-time nature of the media stream requires that the media stream is generated all the time, has no fixed end time, cannot wait until the media stream is off-air before being processed, and the VAD module receives only short audio data, the audio segment, e.g., a PCM segment (an example of a second target audio segment), can be cut in real-time in the audio cutting module according to the output of the VAD module. Subsequently, the sliced audio segment can be input to an AI model for review to determine whether the media stream is compliant or non-compliant.

In a normal case, the speed of Processing the media stream through the FFmpeg API is much faster than the speed of streaming media data transferred through the network, so that a Central Processing Unit (CPU) may wait idle. Therefore, a plurality of processes can be created to process a plurality of media streams simultaneously, and the problem of hardware idle waiting can be solved in a high-concurrency mode, so that the functions of the hardware can run with maximum efficiency. When the decoded audio data is segmented, corresponding encoded audio (playable) segment data can be reserved through an encoding technology, and checking is facilitated.

The following method can be adopted in the system to find the dividing point and realize the method:

1. finding reasonable cut points

If the audio is fixedly segmented according to a certain time length, for example, 60s (assuming that the input of the AI model supports 60s at the longest), a large number of sentences in the audio are segmented into two audio segments, which easily causes inaccurate identification of the AI model. The scheme of the embodiment of the disclosure can cut more reasonably.

In the solution of the embodiment of the present disclosure, a VAD technique in WebRTC is introduced, which may detect whether a short piece of audio (e.g., fixed for 10ms, 20ms, 30ms, etc.) is muted, store the identified result (e.g., 0 for non-muting and 1 for muting) in a buffer, e.g., a mute indication buffer (buffer), store the decoded audio data to be processed, e.g., PCM data, in the buffer, e.g., PCM buffer in a front-back order, and copy (copy) the original audio data in the buffer, e.g., an original data buffer. The original audio data may be audio data obtained by encoding media stream data, or audio data obtained by encoding media stream data and performing processing such as NS processing. Referring to fig. 13, for an example of three buffered data, where values 0 to 7 in the PCM data may represent the index of each detected segment of the PCM data, the specific content of the detected segment may include some binary codes, and VAD is performed on the specific content of the detected segment to determine whether the detected segment is in a mute state or in an un-mute state. If 0 represents the non-mute state and 1 represents the mute state in the mute flag buffer, the detected segments 0 to 4 in the PCM buffer are in the non-mute state and 5 to 7 are in the mute state.

Referring to fig. 14, a specific flow of an audio data processing method may include:

s1401, a packet such as audio data extracted from media stream data is received.

S1402, judging whether the duration in the cache, such as the PCM cache, is larger than or equal to a set threshold, such as 60S. If so, S1403 is executed, otherwise, S1401 is returned to execution.

And S1403, finding whether a mute segment exists or not. For example, the PCM buffer is searched for a silence segment in the PCM data according to the silence flag in the VAD buffer. If there is a mute section, S1404 is performed, and if there is no mute section, S1407 is performed. Each silence segment may include a detection segment of silence state or a plurality of detection segments of continuous silence state.

S1404, determining whether there is a silence segment in the first audio time range, e.g., between 40 th S and 60 th S, in the duration in the buffer. If so, then S1405 is performed, otherwise S1406 is performed.

S1405, select the continuous silence segment with the maximum duration (e.g., the longest duration) in the first audio time range, e.g., from 40 th to 60 th as the first target audio segment. The first target audio segment may also be referred to as a sliced segment, a sliced point, etc. Then, S1408 is executed.

S1406, select the continuous silence segment with the maximum (e.g. longest duration) in the second audio time range, for example, between 0 th S and 40 th S, as the first target audio segment. Then, S1408 is executed.

S1407, slicing directly at the end point, for example, at the 60 th S of the total length end point of the PCM data. Then, S1408 is executed.

S1408, saving the second target audio segment, e.g., the PCM segment, and playable audio data, e.g., data corresponding to the second target audio segment in the original audio data.

S1409, clearing the cache before the splitting. For example, the first target audio piece is deleted from its previous buffered second target audio piece in the PCM buffer. In the original data cache, original data corresponding to the first target audio piece and a second target audio piece that is cached before the first target audio piece is deleted. In the VAD buffer, the VAD result corresponding to the first target audio segment and the second target audio segment that is buffered before the first target audio segment is deleted.

Referring to fig. 14, assuming that the maximum length received by the AI model is 60s, the data in the cache can be sliced each time the duration of the data in the cache exceeds a certain threshold of 60s (which can be configured by the user).

Referring to fig. 14, it can be determined whether the 60s buffers are all muted or unmuted. The buffer can be emptied directly if it is completely silent. If all are not muted, the 60s PCM data and the original audio data are directly sliced and saved.

If the data in the buffer includes both silence and un-silence, a cut point may be found between 40s and 60s. The 40 th time is selected to select to start searching at two thirds of the total time, so that the cut audio clip is prevented from being too short, the utilization rate of the system is improved, and a small section of noise clip is not easy to generate. The longest continuous silent segment is searched from the 40 th to the 60 th, and the front and the back of the silent segment are used as segmentation points, so that the segmented audio segment is not too short, and a sentence is prevented from being segmented into two segments of audio as far as possible.

If no mute segment is found between the 40s th and the 60s th, the mute segment can be found from the 40s th position. In this case, the longest silence segment may not need to be detected, because the 20s period after the 40s is long enough to be definitely separated.

After each segmentation, the data of the three caches in the corresponding time period can be cleared, so that the cached data volume is always less than or equal to 60s. Therefore, the memory can not be continuously increased under the condition of ensuring the real-time performance. After the slicing, PCM data and original data of 60s or less are generated, and excessively short audio pieces are reduced as much as possible.

2. Concurrent hair

Under normal conditions, the speed of audio and video coding and decoding and the speed of storing the audio and video coding and decoding to a disk are far higher than the speed of acquiring stream data from a network, so that the problem of low utilization rate of CPU idle waiting resources is caused. And, the machine load may cause problems such as the write disk speed occasionally becoming slower due to other traffic effects. The scheme of the example provides a concurrency architecture to solve the problems of machine resource utilization rate and sub-module operation speed, and improves the concurrency number of single machine processing media streams.

1) Multi-process processing

In order to solve the problem of idle waiting of a CPU, a user can self-define the number of started processes according to the machine load condition, each process independently pulls one media stream, and the same media stream is only pulled by one process. The media stream data is processed inside the process, and finally, the audio original data and the PCM data which are segmented by using the VAD algorithm are output.

2) Multi-thread processing

The operation process inside the process is shown in fig. 12. If each module is serial, if a module runs slowly due to environmental problems after receiving a media data packet, the processing of the whole media stream may be slow, which reduces the real-time performance and easily loses part of the media stream data. Common environmental issues may include disk Input/Output (I/O) being affected by other programs, saving audio files onto network storage being affected by network fluctuations, and so forth. To increase processing speed, interactions between each module in the system may interact through a local cache (or queue), each module may be executed through a separate thread. For example, thread 1 executes the function of the codec module, decodes to obtain PCM data, encodes to obtain playable audio data, and stores the playable audio data in the corresponding buffer (or queue). Thread 2 performs the function of the VAD detection module, and each time a segment of PCM data of a certain duration, for example 20ms, is read, the VAD identifies the mute flag of each segment, and stores the mute flag in the corresponding buffer (or queue). And the thread 3 executes the function of the segmentation module, and if the cache (or the queue) corresponding to the cache is checked to reach the set duration, the segmentation is executed. And the main thread can also detect whether the three threads survive or not, and rebuild without survival. The three threads may interact through the cache (or queue) described above. In this case, the occasional jamming of a single module does not affect the smoothness of the overall operation of the system, and does not result in the loss of part of the streaming media data.

The scheme provided by the embodiment of the disclosure can not only solve the problems of false report, missing report and manpower waste of the traditional real-time media stream voice auditing method, but also solve the problem of inaccurate identification caused by splitting one sentence of the audio data audited by the AI model into two segments of audio. In addition, by means of multiple processes on the machine and multiple threads in the processes, the resource utilization rate can be effectively improved, the problem caused by blocking can be avoided, and the quantity of concurrent processing of real-time media streams can be improved. The scheme provided by the embodiment of the disclosure can be applied to auditing and segmentation scenes of ultra-long videos, so that the same sentence is prevented from being segmented into different segments. For example, when the method is applied to a plurality of real online systems, the labor waste of manual review can be reduced, and the accuracy of AI model identification is greatly improved.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 15 shows a schematic block diagram of an example electronic device 1500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 15, the apparatus 1500 includes a computing unit 1501 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 1502 or a computer program loaded from a storage unit 1508 into a Random Access Memory (RAM) 1503. In the RAM 1503, various programs and data necessary for the operation of the device 1500 can also be stored. The calculation unit 1501, the ROM 1502, and the RAM 1503 are connected to each other by a bus 1504. An input/output (I/O) interface 1505 is also connected to bus 1504.

A number of components in the device 1500 are connected to the I/O interface 1505, including: an input unit 1506 such as a keyboard, a mouse, and the like; an output unit 1507 such as various types of displays, speakers, and the like; a storage unit 1508, such as a magnetic disk, optical disk, or the like; and a communication unit 1509 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1509 allows the device 1500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1501 may be various general and/or special purpose processing components having processing and computing capabilities. Some examples of the computation unit 1501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computation chips, various computation units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1501 executes the respective methods and processes described above, such as an audio data processing method. For example, in some embodiments, the audio data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1500 via the ROM 1502 and/or the communication unit 1509. When the computer program is loaded into the RAM 1503 and executed by the computing unit 1501, one or more steps of the audio data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1501 may be configured to perform the audio data processing method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An audio data processing method, comprising:

acquiring audio data to be processed; the audio data to be processed is obtained by extracting based on media stream data generated in real time, and is stored in a first storage unit;

reading the detection segment with the second duration from the audio data to be processed of the first storage unit each time to perform voice activity detection VAD to obtain a VAD result; the VAD result is stored in a second storage unit;

segmenting the audio data to be processed according to the first target audio segment, including: segmenting the audio data to be processed in the first storage unit at the starting point and the ending point of the first target audio segment; saving data before the starting point of the first target audio segment as a second target audio segment; deleting the first target audio piece from the first storage unit; and continuing to accumulate the audio duration of the audio data to be processed in the first storage unit from the end point of the first target audio clip.

2. The method of claim 1, wherein finding a first target audio segment in the audio data to be processed according to the VAD result comprises:

and under the condition that the audio time corresponding to the audio data to be processed reaches a first time length according to the first storage unit and/or the second storage unit, searching a first target audio segment in the audio data to be processed according to the VAD result.

3. The method according to claim 2, wherein reading the detection segment of the second duration from the audio data to be processed in the first storage unit each time to perform VAD to obtain VAD results comprises:

obtaining a mute mark of the detection segment, wherein the mute mark is used for indicating that the detection segment is in a mute state or a non-mute state;

and storing the mute mark of the detection segment into the second storage unit to obtain the VAD result.

4. The method according to any of claims 1 to 3, wherein finding a first target audio segment in the audio data to be processed according to the VAD result comprises:

and searching a first target audio segment in the audio data to be processed according to a silence mark of a detection segment in the audio data to be processed, wherein the silence mark comprises the VAD result, and the first target audio segment comprises a detection segment of a silence state or a plurality of detection segments of continuous silence states in the audio data to be processed.

5. The method according to claim 4, wherein searching for a first target audio segment in the audio data to be processed according to the silence flag of the detected segment in the audio data to be processed included in the VAD result comprises:

and searching a mute section in a first audio time range corresponding to the audio data to be processed as the first target audio section, wherein the mute section comprises a detection section in a mute state or a plurality of continuous detection sections in the mute state, and the audio duration corresponding to the first audio time range is less than the audio duration corresponding to the audio data to be processed.

6. The method of claim 5, wherein searching for a mute section as the first target audio section within a first audio time range corresponding to the audio data to be processed comprises:

and under the condition that a plurality of mute sections exist in the first audio time range, determining the corresponding mute section with the longest audio time length as the first target audio section.

7. The method according to claim 4, wherein searching for a first target audio segment in the audio data to be processed according to the silence flag of the detected segment in the audio data to be processed included in the VAD result comprises:

and under the condition that no mute segment exists in a first audio time range corresponding to the audio data to be processed, searching the first target audio segment in a second audio time range corresponding to the audio data to be processed, wherein the mute segment comprises a detection segment in a mute state or a plurality of continuous detection segments in the mute state.

8. The method of claim 7, wherein searching for the first target audio segment within a second audio time range corresponding to the audio data to be processed comprises:

and in the second audio time range, taking a first mute segment found forward from a first audio time point corresponding to the audio data to be processed as the first target audio segment, where the first audio time range is a time range from the first audio time point to an end point of the audio data to be processed, and the second audio time range is a time range from the first audio time point to a start point of the audio data to be processed.

9. The method according to claim 4, wherein searching for a first target audio segment in the audio data to be processed according to the silence flag of the detected segment in the audio data to be processed included in the VAD result comprises:

under the condition that a detection segment to which a second audio time point corresponding to the audio data to be processed belongs is in a mute state, searching an end point of the first target audio segment in a third audio time range corresponding to the audio data to be processed, searching a starting point of the first target audio segment in a fourth audio time range corresponding to the audio data to be processed, and determining the first target audio segment based on the end point and the starting point, wherein the third audio time range is a time range from the second audio time point to the end point of the audio data to be processed, and the fourth audio time range is a time range from the second audio time point to the starting point of the audio data to be processed.

10. The method of claim 9, wherein finding the end point of the first target audio segment within a third audio time range of the audio data to be processed comprises one of:

11. The method of claim 9 or 10, wherein finding the start of the first target audio segment within a fourth audio time range of the audio data to be processed comprises one of:

in the fourth audio time range, taking the end point of the detected segment in the first non-mute state, which is found forward from the second audio time point, as the starting point of the first target audio segment;

12. The method of any of claims 1-3, or 5-10, wherein segmenting the audio data to be processed according to the first target audio segment comprises:

and segmenting the audio data to be processed based on the starting point and the end point of the first target audio segment.

13. The method of claim 3, further comprising one of:

14. The method of any of claims 1 to 3, or 5 to 10, or 13, obtaining audio data to be processed, comprising:

and extracting the audio data to be processed from the media stream data through the processes, wherein different media stream data are processed through different processes.

15. The method of claim 14, wherein the process comprises a first thread, a second thread, and a third thread, data being exchanged between different threads through a memory location; wherein:

the second thread is used for reading the audio data to be processed from the first storage unit to perform VAD, and storing VAD results into the second storage unit;

16. The method of any of claims 1-3, or, 5-10, or, 13, or, 15, further comprising:

and acquiring playable audio data corresponding to a second target audio clip of the audio data to be processed from the original audio data stored in the third storage unit.

17. An audio data processing apparatus comprising:

the acquisition module is used for acquiring audio data to be processed; the audio data to be processed is obtained by extracting based on media stream data generated in real time, and the audio data to be processed is stored in a first storage unit;

the VAD module is used for reading the detection segment with the second duration from the audio data to be processed of the first storage unit each time to perform voice activity detection VAD so as to obtain a VAD result; the VAD result is stored in a second storage unit;

a segmentation module, configured to segment the audio data to be processed according to the first target audio segment, including: segmenting the audio data to be processed in the first storage unit at the starting point and the ending point of the first target audio segment; saving data before the starting point of the first target audio segment as a second target audio segment; deleting the first target audio piece from the first storage unit; and continuing to accumulate the audio duration of the audio data to be processed in the first storage unit from the end point of the first target audio clip.

18. The apparatus of claim 17, wherein,

the searching module is used for searching a first target audio segment in the audio data to be processed according to the VAD result under the condition that the audio time corresponding to the audio data to be processed is determined to reach a first time length according to the first storage unit and/or the second storage unit.

19. The apparatus according to claim 18, wherein the VAD module is configured to obtain a silence flag of the detected segment, the silence flag indicating whether the detected segment is in a silence state or in a non-silence state; and storing the mute mark of the detection segment into the second storage unit to obtain the VAD result.

20. The apparatus according to any of claims 17 to 19, wherein the searching module is configured to search the to-be-processed audio data for a first target audio segment according to a silence flag of a detected segment in the to-be-processed audio data included in the VAD result, where the first target audio segment includes a detected segment of a silence state or a plurality of detected segments of consecutive silence states in the to-be-processed audio data.

21. The apparatus of claim 20, wherein the lookup module comprises:

the first searching submodule is configured to search a mute segment in a first audio time range corresponding to the audio data to be processed as the first target audio segment, where the mute segment includes a detection segment in a mute state or multiple detection segments in a continuous mute state, and an audio duration corresponding to the first audio time range is smaller than an audio duration corresponding to the audio data to be processed.

22. The apparatus of claim 21, wherein the first lookup submodule is configured to determine, as the first target audio segment, a corresponding silence segment with a longest audio duration if multiple silence segments exist within the first audio time range.

23. The apparatus of claim 20, wherein the lookup module further comprises:

and the second searching submodule is used for searching the first target audio clip in a second audio time range corresponding to the audio data to be processed under the condition that no mute clip exists in a first audio time range corresponding to the audio data to be processed, wherein the mute clip comprises a detection clip in a mute state or a plurality of continuous detection clips in the mute state.

24. The apparatus according to claim 23, wherein the second search submodule is configured to use a first silence segment searched forward from a first audio time point corresponding to the audio data to be processed as the first target audio segment in the second audio time range, where the first audio time range is a time range from the first audio time point to an end point of the audio data to be processed, and the second audio time range is a time range from the first audio time point to a start point of the audio data to be processed.

25. The apparatus of claim 20, wherein the lookup module further comprises:

a third searching submodule, configured to search an end point of the first target audio segment within a third audio time range corresponding to the audio data to be processed, search a start point of the first target audio segment within a fourth audio time range corresponding to the audio data to be processed, and determine the first target audio segment based on the end point and the start point, where the third audio time range is a time range from the second audio time point to the end point of the audio data to be processed, and the fourth audio time range is a time range from the second audio time point to the start point of the audio data to be processed, when a detected segment to which a second audio time point corresponding to the audio data to be processed belongs is in a mute state.

26. The apparatus of claim 25, wherein the third finding sub-module is configured to find an end point of the first target audio segment within a third audio time range of the audio data to be processed, including one of:

27. The apparatus according to claim 25 or 26, wherein the third finding sub-module is configured to find the start of the first target audio segment within a fourth audio time range of the audio data to be processed, and comprises one of:

28. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-16.

29. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-16.