WO2023050994A1

WO2023050994A1 - Audio control method and apparatus, device, and computer readable storage medium

Info

Publication number: WO2023050994A1
Application number: PCT/CN2022/108334
Authority: WO
Inventors: 肖源; 罗厅
Original assignee: 中兴通讯股份有限公司
Priority date: 2021-09-28
Filing date: 2022-07-27
Publication date: 2023-04-06
Also published as: CN115883527A

Abstract

An audio control method and apparatus, a device, and a computer readable storage medium. The method comprises: obtaining network feature data in a network processing process and audio feature data in an audio processing process (S100); inputting the network feature data and the audio feature data into a classifier model to obtain multiple audio control instructions (S200); and controlling a corresponding audio processing operation in a corresponding audio processing process according to the audio control instructions (S300).

Description

Audio control method, device, device and computer-readable storage medium

Cross References to Related Applications

This application is based on a Chinese patent application with application number 202111145232.2 and a filing date of September 28, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.

technical field

The embodiments of the present application relate to but are not limited to the field of network communication technologies, and in particular, relate to an audio control method, device, device, and computer-readable storage medium.

Background technique

Audio communications are often included in network communication systems. In some cases, a variety of methods are generally used to optimize the network, such as optimizing the network topology, sending redundant packets at the sending end, etc., to ensure the audio quality at the playback end. However, for the optimization of the network topology, due to the lack of controllability and the large influence of objective factors, the operability is poor; the method of sending redundant packets at the sender needs the support of the sender, but in the case of congestion, the network situation will become more complicated. Oops, thus affecting the audio effect.

Contents of the invention

The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.

Embodiments of the present application provide an audio control method, device, device, and computer-readable storage medium.

In the first aspect, an embodiment of the present application provides an audio control method, including: acquiring network characteristic data during network processing and audio characteristic data during audio processing; inputting the network characteristic data and the audio characteristic data To the classifier model, a number of audio control instructions are obtained; according to the audio control instructions, the corresponding audio processing operations in the corresponding audio processing process are controlled.

In the second aspect, the embodiment of the present application further provides an audio control device, configured to execute the audio control method described in the first aspect above.

In the third aspect, the embodiment of the present application also provides an audio control device, including: a memory, a processor, and a computer program stored in the memory and operable on the processor, and the processor executes the The computer program implements the audio control method as described in the first aspect above.

In a fourth aspect, the embodiment of the present application further provides a computer-readable storage medium storing computer-executable instructions, and the computer-executable instructions are used to execute the audio control method as described in the first aspect above.

Additional features and advantages of the application will be set forth in the description which follows, and, in part, will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Description of drawings

The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the description, and are used together with the embodiments of the application to explain the technical solution of the present invention, and do not constitute a limitation to the technical solution of the present invention.

FIG. 1 is a schematic flowchart of an audio control method provided by an embodiment of the present invention;

Fig. 2 is a schematic flow chart of obtaining audio control instructions provided by an embodiment of the present invention;

Fig. 3 is a schematic flowchart of an audio stretch control instruction or an audio compression control instruction provided by an embodiment of the present invention;

Fig. 4 is a schematic flowchart of a stop audio decoding control instruction or an audio decoding average speed control instruction provided by an embodiment of the present application;

FIG. 5 is a schematic flow diagram of audio stretching data provided by an embodiment of the present application;

FIG. 6 is a schematic flow diagram of audio compression data provided by an embodiment of the present application;

Fig. 7 is a schematic flow diagram of a stop audio decoding control instruction or an audio decoding average speed control instruction provided by another embodiment of the present application;

Fig. 8 is a schematic structural diagram of an audio control system provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.

It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, it can be executed in a different order than the module division in the device or the flowchart in the flowchart. steps shown or described. The terms "first", "second" and the like in the specification and claims and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.

In some cases, a variety of methods are generally used to optimize the network, such as optimizing network jitter and network packet loss. Specifically, optimization can be performed by optimizing the network topology and sending redundant packets at the sending end to ensure the audio quality at the playback end. However, for the optimization of the network topology, due to the lack of controllability and the large influence of objective factors, the operability is poor; and the method of sending redundant packets at the sender needs to be supported by the sender in the case of congestion, but it also needs to be supported in the case of congestion. It will make the network situation worse and affect the audio effect.

Based on this, embodiments of the present application provide an audio control method, device, device, and computer-readable storage medium, which can effectively ensure audio effects and have good operability.

It can be understood that the embodiment of the present application specifically relates to real-time audio reception and playback, including but not limited to application scenarios such as video conferencing, video playback, real-time voice chat, or Lianmai.

The embodiments of the present application will be further described below in conjunction with the accompanying drawings.

The embodiment of the first aspect of the present application specifically provides an audio control method, as shown in FIG. 1 , which is a schematic flowchart of the audio control method provided by an embodiment of the present application.

The audio control method of the embodiment of the present application includes but is not limited to the following steps:

Step S100, acquiring network feature data during network processing and audio feature data during audio processing;

Step S200, inputting network feature data and audio feature data into the classifier model to obtain several audio control instructions;

Step S300, according to the audio control instruction, control the corresponding audio processing operation in the corresponding audio processing process.

It can be understood that the embodiment of the present application can perform feature extraction on the corresponding feature data in the network processing process and audio processing process, so that the classifier model can perform data processing on the feature data, so as to obtain classification results, that is, several audio Control instructions, and then adaptively issue the audio control instructions according to several audio control instructions obtained by classification to control the corresponding audio processing operations in the corresponding audio processing process, so as to effectively ensure the audio effect and have good operability.

It can be understood that, the classifier model in the embodiment of the present application may adopt a support vector machine (Support Vector Machine, SVM). The support vector machine can perform binary classification on feature data (such as network feature data and audio feature data) in a supervised learning manner, and it is a generalized linear classifier.

In the embodiment of the present application, a support vector machine is used as a classifier model to classify network feature data and audio feature data with supervised learning. The support vector machine of this embodiment usually has three kinds of kernel functions: linear kernel, polynomial kernel and Gaussian kernel. Since the linear kernel has the advantages of few parameters and fast iteration, it has a good effect on the classification of supervised learning. Therefore, this embodiment adopts the support vector machine of the linear kernel as the classifier model, and its training time is shorter and the accuracy is higher.

It can be understood that different classifiers can also be used to replace the support vector machine in the embodiment of the present application, including but not limited to: deep belief network, linear classifier, random forest and other classifiers.

It can be understood that, in the embodiment of the present application, the classifier model needs to be constructed before the network feature data and the audio feature data are input into the classifier model.

The classifier model can be constructed by means of offline training and online training.

For example, for the offline training method: collect feature training data such as network feature training data and audio feature training data from different platforms to construct a training sample, and store the training sample as data. Then, the classifier model is trained by inputting the training samples into a high-performance machine. In offline training, unsupervised clustering may not be performed on the feature training vectors corresponding to the training samples.

For the online training method: On the basis of the offline training classifier model, iteratively fine-tuning the classifier model according to the network processing conditions and audio processing conditions of different platforms, so as to make the classifier model more robust. In online training, a sub-thread needs to be specially set up for iterative fine-tuning, and the time interval between two online trainings should not be too short.

It can be understood that in some cases, in audio communication, in order to ensure the audio quality of the playback end, the method of designing a static buffer queue at the playback end can be used to optimize the network, such as optimizing network jitter and packet loss. Through the method of designing a static buffer queue on the playback side, although it can resist large jitters, if the length of the buffer queue is set unreasonably, it will increase the delay, resulting in poor real-time performance, thereby affecting the audio effect.

Based on this, the network feature data in this embodiment of the present application includes network cache queue data, network jitter data, and network packet loss rate data, and the network processing process includes a network cache queue process and a network prediction process.

Among them, the network cache queue data is obtained from the network real-time transmission protocol packet data through the network cache operation in the network cache queue process; the network jitter data and the network packet loss rate data are both obtained from the network real-time transmission protocol packet data through the network cache queue process. After the cache operation, it is obtained through the network prediction operation in the network prediction process.

In the embodiment of the present application, audio coding data is packaged to form network real-time transmission protocol packet data. The network real-time transport protocol packet data can be used as the initial input data of the audio control method of this embodiment. During the process of the network real-time transmission protocol packet data being transmitted to the network cache queue, the corresponding network cache operation is performed on the network real-time transmission protocol packet data through the network cache queue process, so as to realize the network caching effect on the initial input data, that is, the network real-time transmission protocol packet data . For example, in the network cache queue process, each received network real-time transmission protocol packet data can be sorted, and the sorting method can be sorted according to the corresponding time when the network real-time transmission protocol packet data enters the network cache queue process sequentially, so that The serial number data corresponding to each network real-time transmission protocol packet data is obtained, so as to facilitate further calculation to obtain network jitter data and network packet loss rate data.

The network cache queue data may be network cache queue length data, for example, the size of the network cache queue length data may be expressed in bytes. The network cache queue data is the corresponding cache number of network real-time transport protocol packet data in the network cache queue process.

During the process of network real-time transmission protocol packet data entering the network cache queue, the corresponding network cache operation is performed to obtain the network cache queue data. After the corresponding network cache operation in the network cache queue process, the network real-time transmission protocol packet data is taken out. At this time, the network real-time transmission protocol packet data enters the network prediction process, and the network jitter data and network packet loss are obtained through the corresponding network prediction operation. rate data. By setting a dynamic buffer queue, the delay can be effectively reduced and the real-time performance can be improved.

It can be understood that the audio feature data in this embodiment of the present application includes audio decoding rate data, audio buffer queue data and audio consumption rate data, and the audio processing process includes audio decoding process, audio buffering process and audio consumption process.

Among them, the audio decoding rate data is obtained by the network real-time transmission protocol packet data through the network cache operation in the network cache queue process, and then through the audio decoding operation in the audio decoding process; the audio cache queue data is obtained by the network real-time transmission protocol packet data through the audio After the audio decoding operation in the decoding process obtains the audio decoding data, the audio decoding data is obtained through the audio buffering operation in the audio buffering process; the audio consumption rate data is obtained from the audio decoding data through the audio buffering operation in the audio buffering process, and then through The audio consuming operation in the audio consuming process gets.

The network real-time transmission protocol packet data in the embodiment of the present application is transmitted to the audio decoding process through the corresponding audio decoding operation after the network buffering operation in the network buffering queue process to obtain audio decoding rate data and audio decoding data; after that, the audio The audio buffer queue data is obtained by decoding the data through the corresponding audio buffering operation in the audio buffering process; after the audio decoding data is processed by the corresponding audio buffering operation in the audio buffering process, the audio consumption rate data is obtained through the audio consumption operation in the audio consumption process .

It can be understood that the audio consumption in this embodiment of the present application may be audio playback.

The audio buffer queue data may be audio buffer queue length data, for example, the size of the audio buffer queue length data may be expressed in bytes. The audio buffer queue data is the number of buffers corresponding to the audio decoding data in the audio buffering process.

It can be understood that the defined network buffer queue data is represented as L _net ; the audio buffer queue data is represented as L _audio .

In one embodiment, the audio codec generally uses audio data corresponding to a unit time of 20 ms as a basic processing unit. For example, the audio data corresponding to a unit time of 20 ms is obtained through the audio encoding operation in the audio encoding process, and the audio encoding data is packaged to form network real-time transmission protocol packet data, which is used as the initial Input data enters the network cache queue process during network processing. The network real-time transport protocol packet data in this embodiment is subjected to the network buffering operation in the network buffering queue process, and then the audio decoding operation in the audio decoding process to obtain audio decoding data. It can be understood that the audio decoding data may also be audio data corresponding to a unit time of 20 ms.

The network real-time transmission protocol packet data is subjected to the network prediction operation in the network prediction process after the corresponding network cache operation in the network cache queue process. For example, the corresponding network prediction operation in the network prediction process can be: by obtaining the time difference between two adjacent network real-time transmission protocol packet data entering the network cache queue, and the corresponding network real-time transmission protocol packet data in the network cache queue process The number of cached data, the serial number data corresponding to each network real-time transmission protocol packet data in the network cache queue process, etc., are used to calculate the network cache queue data, network jitter data and network packet loss rate data. After that, the network cache queue data, network jitter data and network packet loss rate data are fed back to the audio control process. It can be understood that the audio control process in this embodiment is to execute steps S200 and S300.

Define the time difference between two packets in the process of two adjacent network real-time transmission protocol packets entering the network cache queue as Net _jitter , that is, the time difference Net _jitter is expressed as the network jitter data corresponding to the network real-time transmission protocol packet data; where , Net _jitter =T _current -T _last ; T _current represents the corresponding time of the current network real-time transmission protocol packet data entering the network cache queue process, and T _last represents the corresponding time of the last network real-time transmission protocol packet data entering the network cache queue process.

It can be understood that in the process of the network cache queue, the data of each network real-time transmission protocol packet received can be sorted, that is, sorted according to the corresponding time when each network real-time transmission protocol packet data enters the network cache queue in sequence, The serial number data corresponding to each network real-time transmission protocol packet data is obtained.

Define the network packet loss rate data as Net _lost ; divide the corresponding maximum serial number data and network real-time transmission protocol packet data in the network cache queue process by the number of buffers corresponding to the network real-time transmission protocol packet data in the network cache queue process The difference of the minimum sequence number data corresponding to the transmission protocol packet data obtains the network packet loss rate data Net _lost , that is, Net _lost =num(rtp)/(index(rtp) _max -index(rtp) _min ); wherein, rtp represents the network Real-time transmission protocol packet data; num(rtp) indicates the cache number of corresponding network real-time transmission protocol packet data in the network cache queue process; index(rtp) _max indicates the maximum serial number corresponding to network real-time transmission protocol packet data in the network cache queue process data; index(rtp) _min indicates the minimum serial number data corresponding to the network real-time transmission protocol packet data in the network cache queue process.

It can be understood that the audio feature data in this embodiment includes audio decoding rate data, audio buffer queue data, and audio consumption rate data. Define audio decoding rate data as V _decode , and audio consumption rate data as V _play .

V _decode = T _audio1 /T ₂ -T ₁ , wherein, T _audio1 represents the decoding time corresponding to the audio decoding data in the preset first calculation cycle during the audio decoding process, and the unit is ms; T ₂ represents the above-mentioned first calculation The end time corresponding to the cycle, _T1 represents the start time corresponding to the above-mentioned first calculation cycle; it can be understood that audio1 represents the data length corresponding to the audio decoding data in the preset first calculation cycle during the audio decoding process.

V _play =T _audio2 /T ₄ -T ₃ , wherein, T _audio2 represents the consumption time corresponding to the audio consumption data in the preset second calculation period during the audio consumption process, and the unit is ms, and T ₄ represents the above-mentioned second calculation The end time corresponding to the period, _T3 indicates the start time corresponding to the second calculation period above; it can be understood that audio2 indicates the length of the audio consumption data corresponding to the audio consumption data in the preset second calculation period during the audio consumption process ; Wherein, the audio consumption data may be audio stretching data or audio compression data or audio decoding data. It can be understood that the first calculation period and the second calculation period may be the same or different, which is not limited herein. For example, if the second calculation period is 100ms, and the consumption time corresponding to the audio consumption data corresponding to a certain audio consumption data length is 80ms, then the corresponding audio consumption rate data is 0.8. In other embodiments, it can also be expressed as 80%.

It can be understood that the network buffer queue data, network jitter data, network packet loss rate data, audio decoding rate data, audio buffer queue data and audio consumption rate data are all instantaneous data. A feature vector F is constructed by acquiring the aforementioned network feature data and audio feature data within a preset calculation period. Feature vector F[N]=[L _net ΛL _audio ΛNet _jitter ΛNet _lost ΛV _decode ΛV _play ], to feed back to the audio control process, that is, by inputting the network feature data and audio feature data into the classifier model, a number of audio control instructions are obtained , according to the audio control instruction, control the corresponding audio processing operation in the corresponding audio processing process.

It can be understood that w and b are included in the trained classifier model; by inputting the feature vector F into the classifier model, the class label C is obtained as an output, that is, the audio control instruction is obtained. C=sigmoid(w*F+b), where w represents the feature matrix, b represents the offset, F represents the feature vector, sigmoid represents the activation function, and C represents the category label. By obtaining the feature vector, the audio control command to be issued can be predicted to reduce the delay and ensure the audio effect.

Referring to FIG. 2, it can be understood that step S200 includes but is not limited to:

Step S210, input the network buffer queue data, network jitter data, network packet loss rate data, audio decoding rate data, audio buffer queue data and audio consumption rate data into the classifier model to obtain audio decoding control instructions and audio modulation and speed control instruction.

It can be understood that the network feature data includes network jitter data and network packet loss rate data; the audio feature data includes audio decoding rate data, audio buffer queue data, and audio consumption rate data.

The embodiment of the present application models the network feature data in the network processing process and the audio feature data in the audio processing process, that is, by inputting the network feature data and audio feature data into the trained classifier model, several Audio control commands, namely, audio decoding control commands and audio pitch and speed control commands. On the premise of not affecting the audio effect such as audio consumption/playing effect, the corresponding audio processing operation in the corresponding audio processing process is controlled according to the audio control instruction. The embodiments of the present application can adapt to different network conditions, ensure audio effects, and further ensure the effectiveness and real-time performance of video communication.

It can be understood that, in the embodiment of the present application, by extracting the feature data corresponding to each link of the network processing process and the audio processing process, the network cache queue data, network jitter data, network packet loss rate data, and audio decoding rate data are obtained. data, audio buffer queue data, and audio consumption rate data. The embodiment of the present application can avoid artificially setting thresholds and formulating rules, etc., and uses a classifier model to carry out supervised classification of feature data, so as to facilitate adaptive delivery of audio control instructions to reduce delay.

It can be understood that, according to the embodiment of the present application, according to the audio control instruction, the corresponding audio processing operation in the corresponding audio processing process is controlled, including but not limited to:

According to the audio decoding control instruction, control the corresponding audio decoding operation in the corresponding audio decoding process; and/or,

According to the audio pitch shifting and speed changing control instruction, the corresponding audio pitch shifting and speed changing operation during the corresponding audio pitch shifting and speed changing process is controlled.

It can be understood that the process of audio pitch shifting and speed change may include audio stretching and audio compression.

It can be understood that the embodiment of the present application can reasonably utilize the feature data corresponding to each link of the network processing process and the audio processing process. In this process, there is no need to set a threshold, and the feature data is processed and analyzed by the classifier model. The corresponding audio processing operation in the corresponding audio processing process is controlled through the adaptive control method, and the operability is good.

Referring to FIG. 3 , in some embodiments, the audio pitch and speed control command can be obtained by at least one of the following steps:

Step S201, when the data length corresponding to the audio buffer queue data is less than the first audio consumption data length corresponding to the first audio consumption rate, the audio modulation and speed control instruction is an audio stretching control instruction; or

Step S202, when the data length corresponding to the audio buffer queue data is greater than the first audio data length corresponding to the first time interval threshold and smaller than the second audio data length corresponding to the second time interval threshold, or the data length corresponding to the audio buffer queue data is greater than The second audio consumption data length corresponding to the second audio consumption rate is less than the third audio consumption data length corresponding to the third audio consumption rate, and the audio modulation speed control instruction is an audio compression control instruction, wherein the second audio consumption data length is greater than the first audio consumption data length An audio consumes data length.

In this embodiment, according to the classification conditions, the network feature data and audio feature data are input into the classifier model, and the corresponding audio control instructions can be obtained. According to the classification conditions, the audio pitch and speed control instructions can be audio stretching control instructions or Audio compression control commands.

It can be understood that the above step S201 and step S202 are only a classification condition of the corresponding audio stretching control instruction or audio compression control instruction. In other embodiments, the audio stretching control instruction or audio compression control instruction can also be obtained according to other characteristic data. The audio compression control command will not be repeated here.

Referring to Figure 4, in some embodiments, the audio decoding control instruction can be obtained by at least one of the following steps:

Step S203, when the data length corresponding to the audio buffer queue data is greater than the second audio data length corresponding to the second time interval threshold or greater than the third audio consumption data length corresponding to the third audio consumption rate, the audio decoding control instruction is to stop audio decoding control order; or

Step S204, obtain the first total number of executions corresponding to the audio stretching control instruction and the second total number of executions corresponding to the audio compression control instruction within the preset time, when there is no stop audio decoding control instruction within the preset time, and the first execution The absolute value of the difference between the total number of times and the second total number of executions is smaller than a preset threshold, and the audio decoding control instruction is an audio decoding average speed control instruction.

In this embodiment, according to the classification conditions, the network feature data and audio feature data are input into the classifier model, and the corresponding audio control instructions can be obtained. According to the classification conditions, the audio modulation and speed control instructions can be stop audio decoding control instructions or Audio decoding average speed control command.

It can be understood that the above steps S203 and S204 are only a classification condition for the corresponding stop audio decoding control instruction or audio decoding average speed control instruction. In other embodiments, the stop audio decoding control can also be obtained according to other characteristic data. Instructions or audio decoding average speed control instructions will not be described in detail here.

It can be understood that the audio control instruction in this embodiment includes an audio stretch control instruction, an audio compression control instruction, a stop audio decoding control instruction, and an audio decoding average speed control instruction.

For the sending of audio control commands, a sending time interval threshold T _control may be preset, that is, T _control means that the audio control command is sent adaptively every time a sending time interval threshold passes. It can be understood that the T _control may correspond to the data length of an RTP packet data, that is, the transmission time corresponding to the RTP packet data of a preset data length, which may be expressed as T _control .

For the classifier model of this embodiment, in addition to the feature training vector, each feature training data such as network feature training data and audio feature training data needs to have a corresponding category label, in order to train a better classifier model. After the classifier model is built, the feature vector F is input into the classifier model, and the category label can be outputted, and the category label corresponds to the audio control instructions of the above four categories (i.e. audio stretch control instruction, audio compression control instruction , stop audio decoding control instruction and audio decoding average speed control instruction), so as to achieve the effect of control, and the operability is good.

When the data length corresponding to the audio buffer queue data is greater than the second audio data length corresponding to the second time interval threshold 6*T _control or greater than the third audio consumption data length corresponding to the third audio consumption rate 6*V _play' , the category The audio decoding control command corresponding to the tag is the stop audio decoding control command.

When the data length corresponding to the audio buffer queue data is smaller than the first audio consumption data length corresponding to the first audio consumption rate V _play' , the audio pitch and speed change control command corresponding to the category label is an audio stretch control command.

When the data length corresponding to the audio buffer queue data is greater than the first audio data length corresponding to the first time interval threshold 4*T _control and less than the second audio data length corresponding to the second time interval threshold 6*T _control , or the audio buffer queue data The corresponding data length is greater than the second audio consumption data length corresponding to the second audio consumption rate 4*V _play' and smaller than the third audio consumption data length corresponding to the third audio consumption rate 6*V _play' , the audio corresponding to the category label The pitch shifting and speed changing control command is an audio compression control command, wherein the length of the second audio consumption data is greater than the length of the first audio consumption data.

Obtain the first total number of executions n _draw corresponding to the audio stretching control instruction within the preset time and the second total number of executions n _compress corresponding to the audio compression control instruction, when there is no such as the current n _control audio control instructions within the preset time Stop the audio decoding control instruction, and the absolute value of the difference between the first total number of executions and the second total number of executions is less than the preset threshold Δn, that is, abs(n _draw -n _compress )<Δn, then the audio corresponding to the label The decoding control command is an audio decoding average speed control command; at this time, it means that the total execution times of the audio stretching control command and the audio compression control command are relatively consistent within the preset time.

It can be understood that, since T _control may correspond to the data length corresponding to one RTP packet data, therefore, the first time interval threshold 4*T _control may correspond to the data length corresponding to 4 RTP packet data, That is, the first audio data length; the second time interval threshold 6*T _control may correspond to the data length corresponding to 6 RTP packet data, that is, the second audio data length.

It can be understood that the first audio consumption rate V _play' represents the audio consumption rate corresponding to the first audio consumption data within a preset unit time during the audio consumption process, wherein the first audio consumption data corresponds to the length of the first audio consumption data ; Therefore, the second audio consumption rate 4*V _play' represents the audio consumption rate corresponding to 4 times the first audio consumption data in the preset unit time during the audio consumption process, wherein 4 times the first audio consumption data corresponds to 4 times The length of the first audio consumption data, that is, the length of the second audio consumption data; the third audio consumption rate 6*V _play' represents the audio consumption rate corresponding to 6 times the first audio consumption data in the preset unit time during the audio consumption process, Wherein, 6 times of the first audio consumption data corresponds to 6 times of the length of the first audio consumption data, that is, the length of the third audio consumption data.

It can be understood that when the audio decoding average speed control command is issued, there are no audio stretch control commands, audio compression control commands, and audio decoding stop control commands.

It can be understood that the above-mentioned category labels, that is, the audio control instructions are classified according to rules/classification conditions. In some embodiments, there is greater repeatability. In order to reduce the number of feature data and reduce the difficulty of data processing of the classifier model, unsupervised clustering is performed on the number of input samples in the same category, such as using k-means clustering algorithm (k-means), This facilitates the extraction of feature data. Through unsupervised clustering, the number of input samples corresponding to each of the four class labels is balanced, and the input samples are more representative.

It can be understood that when the audio consumption rate data is greater than the audio decoding rate data, an audio stretch control command is issued.

In one embodiment, in the network prediction process, when the network jitter and packet loss are more obvious, for example, when the network jitter data and the network packet loss rate data are greater than the set value, the audio decoding data in the audio buffering process will be about to Insufficient consumption in the audio consumption process, that is, insufficient to perform audio consumption operations, and issue audio stretch control commands.

In one embodiment, when the audio consumption rate data fed back during the audio consumption process is less than the audio decoding rate data, an audio compression control command is issued.

In one embodiment, when the network condition is good and the audio decoding rate data is fast, a control instruction to stop audio decoding or an audio compression control instruction is issued.

Referring to FIG. 5, it can be understood that the audio processing process includes an audio stretching process; step S300, including but not limited to:

Step S310, when the audio pitch and speed control command is an audio stretching control command, according to the audio stretching control command, control the audio decoding data through the audio stretching operation in the audio stretching process to obtain audio stretching data;

Step S320, control the audio stretching data to go through the audio consumption operation in the audio consumption process after the audio buffering operation in the audio buffering process.

It can be understood that the audio stretching operation is controlled by the audio stretching control instruction to control the audio decoding data through the audio stretching operation in the audio stretching process to obtain the audio stretching data, so that after the audio stretching data undergoes the audio buffering operation in the audio buffering process, Then go through the audio consumption operation in the audio consumption process.

Referring to FIG. 6, it can be understood that the audio processing process includes an audio compression process; step S300, including but not limited to:

Step S330, when the audio modulation and speed control instruction is an audio compression control instruction, according to the audio compression control instruction, control the audio decoding data to undergo an audio compression operation in the audio compression process to obtain audio compression data;

Step S340, control the audio compression data to go through the audio consuming operation in the audio consuming process after the audio buffering operation in the audio buffering process.

It can be understood that the audio decoding data is controlled through the audio compression operation in the audio compression process through the audio compression control command to obtain the audio compression data, and then the audio compression data is processed through the audio buffer operation in the audio buffer process, and then through the audio consumption process Audio consumption operations in .

It can be understood that, in some embodiments, the audio decoding data in the embodiment of the present application includes silent data, voiced data and unprocessed decoding data. The audio pitch and speed change process includes audio stretching process and audio compression process.

Control the mute data and the voiced sound data through the audio stretching operation in the audio stretching process by the audio stretching control command to obtain the audio stretching data (i.e., the silent stretching data and the voiced sound stretching data), after that, the audio stretching data and the unvoiced sound stretching data After processing the decoded data, it goes through the audio consumption operation in the audio consumption process after the audio buffer operation in the audio buffer process.

Alternatively, audio compression data (i.e. silent compressed data and voiced compressed data) is obtained by controlling the audio compression operation of the silent data and voiced data through the audio compression process through the audio compression control command, after which the audio compressed data and the unprocessed decoded data are processed by audio After the audio caching operation in the caching process, the audio consuming operation in the audio consuming process is performed.

It can be understood that, in the embodiment of the present application, according to the network processing situation and the audio processing situation, controllable delivery of audio modulation and speed control instructions (such as audio stretching control instructions or audio compression control instructions), so that the audio decoding data can be changed by audio modulation and speed After the corresponding audio pitch and speed operation in the process, the audio pitch and speed data (that is, audio stretch data or audio compression data) is obtained, so as to effectively guarantee the audio effect and reduce the delay.

Referring to Fig. 7, it can be understood that the audio decoding process includes stopping the audio decoding process and the audio decoding average speed process; Step S300 includes at least one of the following:

Step S350, when the audio decoding control instruction is a stop audio decoding control instruction, according to the stop audio decoding control instruction, control the network real-time transmission protocol packet data to stop the audio decoding operation during the stop audio decoding process; or

Step S360, when the audio decoding control instruction is an audio decoding average speed control instruction, according to the audio decoding average speed control instruction, control the audio decoding average speed operation during the audio decoding average speed of the network real-time transmission protocol packet data.

The embodiment of the present application controls the network real-time transport protocol packet data to stop the audio decoding operation in the audio decoding process according to the stop audio decoding control instruction, thereby ensuring that the audio buffer operation in the audio buffer process can be performed normally, so as to avoid causing audio buffer Stacking in the process to reduce delay.

According to the audio decoding average speed control instruction, the embodiment of the present application controls the audio decoding average speed operation during the audio decoding average speed process of the network real-time transmission protocol packet data, thereby ensuring that the audio decoding average speed operation during the audio decoding average speed operation is carried out at an average speed , to ensure a better audio effect.

It can be understood that the embodiment of the present application can solve problems such as unreasonable buffer queue length setting, untimely audio stretching or compression, and large audio delay. By setting the audio control method in the embodiment of the present application, the audio control process can The classifier model in classifies network feature data and audio feature data to obtain audio control instructions, thereby effectively reducing delay and achieving the effect of taking both delay and audio into account.

The embodiment of the present application also provides an audio control device, configured to implement the audio control method described in the first aspect above.

Referring to Fig. 8, in some embodiments, the embodiment of the present application also provides an audio control system, including:

An audio control device, configured to execute the audio control method as described in the first aspect above;

It also includes a network buffer queue device, a network prediction device, an audio decoding device, an audio buffer queue device, an audio tone shifting device and an audio consumption device;

Wherein, the network buffer queue device, the audio decoding device, the audio buffer queue device and the audio consumption device are sequentially connected, and the network buffer queue device, the audio decoding device, the audio buffer queue device and the audio consumption device are respectively connected to the audio control device, and the network prediction device They are respectively connected to the network buffer queue device and the audio control device, and the audio frequency shifting device is respectively connected to the audio buffer queue device and the audio control device.

It can be understood that the network cache queue device is used to store and transmit network real-time transmission protocol packet data to adapt to network jitter; the network cache queue device corresponds to the network cache queue process and can be used to perform network cache operations; the network real-time transmission protocol packet data The network cache queue data is obtained after the network cache queue device executes the network cache operation.

Network prediction device: used to model real-time network transmission protocol packet data, and obtain network prediction data such as network jitter data and network packet loss rate data through network prediction operations, such as statistics of the network situation in the previous cycle; network prediction device Corresponding to the network prediction process, it can be used to perform network prediction operations; after the network real-time transmission protocol packet data is executed by the network buffer queue device for network buffer operations, the network prediction device is then used to perform network prediction operations to obtain network jitter data and network packet loss rate data.

Audio decoding device: used to perform audio decoding operations on network real-time transmission protocol packet data to obtain audio decoding rate data and audio decoding data; the audio decoding device corresponds to the audio decoding process and can be used to perform audio decoding operations; network real-time transmission protocol packet data is passed through After the network cache queue device executes the network cache operation, the audio decoding device executes the audio decoding operation to obtain audio decoding rate data and audio decoding data.

Audio buffer queue device: used to store and transmit the audio decoding data after the audio decoding device performs the audio decoding operation, so as to perform adaptive expansion of the audio buffer queue; the audio buffer queue device corresponds to the audio buffer process and can be used to perform audio buffer operations; the network After the audio decoding operation is performed on the RTP packet data by the audio decoding device to obtain the audio decoding data, the audio decoding data is then subjected to an audio buffering operation by the audio buffer queue device.

Audio pitch shifting device: used to perform different audio pitch shifting and speed changing operations on audio decoding data such as silent data and voiced data, such as performing audio stretching or audio compression operations, so as to effectively ensure audio effects. The audio pitch shifting and speed changing device corresponds to the audio pitch shifting and speed changing process, and can be used to perform audio pitch shifting and speed changing operations. The audio pitch shifting operation includes an audio stretching operation and an audio compression operation, and correspondingly, the audio pitch shifting device further includes an audio stretching module and an audio compression module.

For example, the audio control device is used to control the audio decoding data to perform an audio stretching operation through the audio stretching module according to the audio stretching control instruction to obtain audio stretching data; the audio control device is also used to control the audio stretching data through the audio buffer queue After the device executes the audio buffer operation, the audio consumption operation is performed by the audio consumption device; or, the audio control device is used to control the audio decoding data to perform an audio compression operation through the audio compression module to obtain audio compression data according to the audio compression control instruction; the audio control The device is also used to control the audio compression data to execute the audio buffer operation through the audio buffer queue device, and then execute the audio consumption operation through the audio consumption device.

Audio consumption device: used to obtain audio consumption/playback rules for different platforms and playback congestion, such as audio consumption rate data, and feed back to the audio control device; the audio consumption device corresponds to the audio consumption process and can be used to perform audio consumption Operation; in some embodiments, the audio consumption device can also be an audio playback device; after the audio decoding data is performed by the audio buffer queue device to perform an audio buffer operation, the audio consumption device is then used to perform an audio consumption operation to obtain audio consumption rate data.

Audio control device: used to issue audio control instructions, such as issuing stop audio decoding control instructions or audio decoding average speed control instructions to control whether the audio decoding device performs decoding; issue audio stretch control instructions or audio compression control instructions, To control the audio pitch shifting device to perform audio stretching operation or audio compression operation.

The audio control device also includes a classifier module. In the embodiment of the present application, for the four audio control instructions, the feature data can be supervised and learned through the classifier module, so as to classify and output the category label corresponding to the feature data, that is, the corresponding audio Control instructions to achieve the purpose of self-adaptation. The embodiment of the present application performs adaptive adjustment according to network processing conditions and audio processing conditions, with a small amount of parameters and low manual participation, fine-tuning can be performed on a classifier module such as a trained classifier model, and the iteration speed is fast.

In addition, the embodiment of the third aspect of the present application also provides an audio control device, the audio control device includes: a memory, a processor, and a computer program stored in the memory and operable on the processor.

The processor and memory can be connected by a bus or other means.

As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory may include memory located remotely from the processor, which remote memory may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to realize the audio control method of the embodiment of the first aspect above are stored in the memory, and when executed by the processor, the audio control method in the above embodiment is executed, for example, the above-described diagram is executed. Method steps S100 to S300 in 1, method steps S210 in Fig. 2, method steps S201 to S202 in Fig. 3, method steps S203 to S204 in Fig. 4, method steps S310 to S320 in Fig. 5, method steps in Fig. 6 The method steps S330 to S340 of , and the method steps S350 to S360 in FIG. 7 .

The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by the above-mentioned Execution by a processor in the device embodiment can cause the above-mentioned processor to execute the audio control method in the above-mentioned embodiment, for example, execute the method steps S100 to S300 in FIG. 1 described above, the method step S210 in FIG. 3, method steps S203 to S204 in FIG. 4 , method steps S310 to S320 in FIG. 5 , method steps S330 to S340 in FIG. 6 , and method steps S350 to S360 in FIG. 7 .

The embodiment of the present application includes: obtaining the network feature data in the network processing process and the audio feature data in the audio processing process; then inputting the network feature data and audio feature data into the classifier model, and classifying and obtaining several audio control instructions; and then according to The audio control instruction controls the corresponding audio processing operation in the corresponding audio processing process. With such a setting, the embodiment of the present application can perform feature extraction on the corresponding feature data in the network processing process and the audio processing process, so that the classifier model can perform data processing on the feature data, thereby obtaining the classification result, that is, several audio Control instructions, and then adaptively issue the audio control instructions according to several audio control instructions obtained by classification to control the corresponding audio processing operations in the corresponding audio processing process, so as to effectively ensure the audio effect and have good operability.

Those skilled in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware and an appropriate combination thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

The above is a specific description of several embodiments of the present application, but the present application is not limited to the above-mentioned embodiments, and those skilled in the art can also make various equivalent deformations or replacements without violating the spirit of the present application. Equivalent modifications or replacements are all within the scope defined by the claims of the present application.

Claims

A method of audio control, comprising:

Obtain network feature data during network processing and audio feature data during audio processing;

Inputting the network characteristic data and the audio characteristic data into a classifier model to obtain several audio control instructions;

According to the audio control instruction, the corresponding audio processing operation in the corresponding audio processing process is controlled.
The method according to claim 1, wherein the network characteristic data includes network cache queue data, network jitter data and network packet loss rate data, and the network processing process includes a network cache queue process and a network prediction process;

Wherein, the network cache queue data is obtained by the network real-time transport protocol packet data through the network cache operation in the network cache queue process; the network jitter data and the network packet loss rate data are obtained by the network real-time transport protocol The packet data is obtained through the network prediction operation in the network prediction process after the network buffer operation in the network buffer queue process.
The method according to claim 2, wherein the audio feature data includes audio decoding rate data, audio buffer queue data and audio consumption rate data, and the audio processing process includes an audio decoding process, an audio buffering process and an audio consumption process;

Wherein, the audio decoding rate data is obtained through the audio decoding operation in the audio decoding process after the network real-time transport protocol packet data in the network caching queue process through the network caching operation; the audio caching queue After the data is obtained by the audio decoding operation in the audio decoding process from the network real-time transport protocol packet data, the audio decoding data is obtained through the audio buffering operation in the audio buffering process; the audio The consumption rate data is obtained from the audio decoding data through the audio buffering operation in the audio buffering process, and then the audio consuming operation in the audio consuming process.
The method according to claim 3, wherein said inputting said network feature data and said audio feature data into a classifier model to obtain several audio control instructions, including:

inputting the network buffer queue data, the network jitter data, the network packet loss rate data, the audio decoding rate data, the audio buffer queue data and the audio consumption rate data into the classifier model , to obtain the audio decoding control instruction and the audio pitch shifting control instruction.
The method according to claim 4, wherein the audio pitch and speed change control instruction is obtained by at least one of the following steps:

When the data length corresponding to the audio buffer queue data is less than the first audio consumption data length corresponding to the first audio consumption rate, the audio modulation and speed control instruction is an audio stretching control instruction; or

When the data length corresponding to the audio buffer queue data is greater than the first audio data length corresponding to the first time interval threshold and smaller than the second audio data length corresponding to the second time interval threshold, or the data length corresponding to the audio buffer queue data greater than the second audio consumption data length corresponding to the second audio consumption rate and less than the third audio consumption data length corresponding to the third audio consumption rate, the audio modulation and speed control instruction is an audio compression control instruction, wherein the second audio The consumption data length is greater than the first audio consumption data length.
The method according to claim 5, wherein the audio decoding control instruction is obtained by at least one of the following steps:

When the data length corresponding to the audio buffer queue data is greater than the second audio data length corresponding to the second time interval threshold or greater than the third audio consumption data length corresponding to the third audio consumption rate, the audio decoding control instruction to stop audio decoding control commands; or

Obtain the first total number of executions corresponding to the audio stretch control instruction and the second total number of executions corresponding to the audio compression control instruction within a preset time, when the stop audio decoding control instruction does not exist within the preset time , and the absolute value of the difference between the first total number of execution times and the second total number of execution times is smaller than a preset threshold, the audio decoding control instruction is an audio decoding average speed control instruction.
The method according to claim 5, wherein the audio processing process comprises an audio stretching process;

According to the audio control instruction, controlling the corresponding audio processing operation in the corresponding audio processing process includes:

When the audio pitch and speed control instruction is the audio stretching control instruction, according to the audio stretching control instruction, the audio decoding data is controlled to undergo an audio stretching operation in the audio stretching process to obtain audio stretching data;

Controlling the audio stretching data to go through the audio consuming operation in the audio consuming process after the audio buffering operation in the audio buffering process.
The method according to any one of claims 5 to 7, wherein the audio processing process comprises an audio compression process;

According to the audio control instruction, controlling the corresponding audio processing operation in the corresponding audio processing process includes:

When the audio modulation and speed control instruction is the audio compression control instruction, according to the audio compression control instruction, the audio decoding data is controlled to undergo an audio compression operation in the audio compression process to obtain audio compression data;

After the audio compression data is controlled to undergo the audio buffer operation in the audio buffer process, it is then controlled to undergo the audio consumption operation in the audio consumption process.
The method according to claim 6, wherein the audio decoding process comprises stopping the audio decoding process and the audio decoding average speed process;

According to the audio control instruction, controlling the corresponding audio processing operation in the corresponding audio processing process includes at least one of the following:

When the audio decoding control instruction is the stop audio decoding control instruction, according to the stop audio decoding control instruction, control the real-time transport protocol packet data through the stop audio decoding operation in the stop audio decoding process; or

When the audio decoding control instruction is the audio decoding average speed control instruction, according to the audio decoding average speed control instruction, control the audio decoding average speed of the network real-time transmission protocol packet data through the audio decoding average speed process operate.
An audio control device, configured to execute the audio control method according to any one of claims 1-9.
An audio control device, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, it realizes the process described in claims 1 to 9 Any one of the audio control methods.
A computer-readable storage medium storing computer-executable instructions for executing the audio control method according to any one of claims 1-9.