CN114363631B

CN114363631B - Deep learning-based audio and video processing method and device

Info

Publication number: CN114363631B
Application number: CN202111495106.XA
Authority: CN
Inventors: 余丹; 兰雨晴; 黄永琢; 王丹星; 唐霆岳
Original assignee: China Standard Intelligent Security Technology Co Ltd
Current assignee: China Standard Intelligent Security Technology Co Ltd
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-08-05
Anticipated expiration: 2041-12-09
Also published as: CN114363631A

Abstract

The application provides an audio and video processing method and device based on deep learning, and relates to the technical field of data processing. The method comprises the steps of predicting compressed audio and video streams through deep learning and a neural network to obtain predicted data of each frame; respectively obtaining the accuracy of the predicted relevant data and the accuracy of the predicted non-relevant data of each frame according to the comparison between the predicted data of each frame and the original data of the audio and video stream; then judging the level of the current deep learning and neural network prediction according to the accuracy of the relevant data and the accuracy of the non-relevant data; and transmitting the predicted level to a terminal of a worker in a binary form, and displaying the predicted level in a lighting bar form at the terminal. It can be seen that, in the embodiment of the application, the traditional function prediction scheme is replaced by the deep learning and the prediction of the neural network, the compressed audio and video frames are predicted, and the prediction efficiency can be improved.

Description

Deep learning-based audio and video processing method and device

Technical Field

The application relates to the technical field of data processing, in particular to an audio and video processing method and device based on deep learning.

Background

The aim of audio-video compression is to reduce the audio-video data rate on the premise of ensuring the hearing and visual effects as much as possible, and the audio-video compression ratio generally refers to the ratio of the data volume after compression to the data volume before compression. In the related technology, the compression of the audio and video mainly only reserves the motion vectors of an I frame and other frames, a P frame and a B frame are predicted from the I frame, the prediction method is relatively fixed, a lot of information needs to be stored, and the calculation resources are consumed. Although the code stream can be compressed to be very small by the encoding mode, the uncompressed complete code stream is difficult to predict and restore from the compressed code stream, so that the complete code stream can only be retransmitted when the complete code stream is needed. Therefore, there is a need to solve this technical problem.

Disclosure of Invention

In view of the above problems, the present application is proposed to provide a deep learning-based audio/video processing method and apparatus that overcome or at least partially solve the above problems, and the prediction efficiency can be improved by deep learning and prediction of a neural network instead of the conventional function prediction scheme. The technical scheme is as follows:

in a first aspect, an audio and video processing method based on deep learning is provided, which includes the following steps:

predicting the compressed audio and video stream through a deep learning and neural network to obtain predicted data of each frame;

respectively obtaining the accuracy of the predicted relevant data and the accuracy of the predicted non-relevant data of each frame according to the comparison of the predicted data of each frame with the original data of the audio and video stream;

judging the level of current deep learning and neural network prediction according to the accuracy of the relevant data and the accuracy of the non-relevant data;

and transmitting the predicted level to a terminal of a worker in a binary form, and displaying the predicted level in a lighting bar form at the terminal.

In a possible implementation manner, the bar lattice is a vertical bar lattice with a plurality of rows and a column on the terminal, each row of the vertical bar lattice is an independent bar lattice, and the independent bar lattice in each row can be independently controlled to be turned on and off.

In a possible implementation manner, the predicted accuracy of the relevant data and the predicted accuracy of the non-relevant data of each frame are obtained by comparing the predicted data of each frame with the original data of the audio/video stream according to the following formulas:

wherein l (i) represents the accuracy of the relevant data for the ith frame predicted by the neural network through deep learning; f (i) represents the accuracy of the uncorrelated data of the ith frame predicted by the deep learning and neural network; wherein if

Then L (i) ═ 1, if

Then f (i) ═ 1; d _i (a) Representing a binary number on the a bit in the binary form data of the ith frame predicted by the deep learning and neural network; d _i，0 (a) Representing the binary number on the a bit in the ith frame binary form data of the original data of the audio and video stream; g _i (a) Representing a characteristic detection function, and if the binary number on the a bit in the ith frame binary form data of the original data of the audio and video stream is a characteristic number, reflecting the characteristic value of the audio and video stream as a function value G _i (a) If it is 1, the function value G is reversed _i (a)＝0；m _i The bit number of binary number in the ith frame binary form data of the original data of the audio and video stream is represented; | | represents the absolute value; [] ₁₀ Indicating that the values in parentheses are converted to decimal.

In one possible implementation, the binary data transmitted to the staff terminal is obtained according to the accuracy of the relevant data and the accuracy of the non-relevant data by using the following formula:

wherein (C) ₂ Data in binary form representing transmission to a staff terminal; n represents the total frame number of the audio and video stream; Λ represents a logical and; () ₂ The numbers in parentheses represent data in binary form.

In one possible implementation, the following formula is used to control the individual bars on the vertical bar to light up according to the binary data received by the terminal:

wherein k represents the number of independent lattice control lightening on the vertical lattice bar; k represents the total number of independent bars on the vertical bars.

In a second aspect, an audio and video processing device based on deep learning is provided, including:

the prediction module is used for predicting the compressed audio and video stream through deep learning and a neural network to obtain predicted data of each frame;

the comparison module is used for respectively obtaining the accuracy of the predicted related data and the accuracy of the predicted non-related data of each frame according to the comparison of the predicted data of each frame and the original data of the audio and video stream;

the judging module is used for judging the level of the current deep learning and neural network prediction according to the accuracy of the relevant data and the accuracy of the non-relevant data;

and the transmission module is used for transmitting the predicted level to a terminal of a worker in a binary mode and displaying the predicted level in a lighting bar form at the terminal.

In one possible implementation, the comparing module is further configured to:

and respectively obtaining the accuracy of the predicted relevant data and the accuracy of the predicted non-relevant data of each frame by comparing the predicted data of each frame with the original data of the audio and video stream by using the following formula:

Then L (i) ═ 1, if

In one possible implementation, the transmission module is further configured to:

obtaining binary data transmitted to the staff terminal according to the accuracy of the related data and the accuracy of the non-related data by using the following formula:

wherein (C) ₂ Data in binary form representing transmission to a staff terminal; n represents the total frame number of the audio and video stream(ii) a Λ represents a logical and; () ₂ The numbers in parentheses represent data in binary form.

In one possible implementation, the apparatus further includes:

the control module is used for controlling the independent bars on the vertical bar to be lightened according to the binary data received by the terminal by using the following formula:

By means of the technical scheme, the method and the device for processing the audio and video based on the deep learning provided by the embodiment of the application firstly predict the compressed audio and video stream through the deep learning and the neural network to obtain each predicted frame data; respectively obtaining the accuracy of the predicted relevant data and the accuracy of the predicted non-relevant data of each frame according to the comparison between the predicted data of each frame and the original data of the audio and video stream; then judging the level of the current deep learning and neural network prediction according to the accuracy of the relevant data and the accuracy of the non-relevant data; and transmitting the predicted level to the terminal of the staff in a binary form, and displaying the predicted level in a form of lighting bar at the terminal. It can be seen that, in the embodiment of the application, the traditional function prediction scheme is replaced by the deep learning and the prediction of the neural network, the compressed audio and video frames are predicted, and the prediction efficiency can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 shows a flowchart of an audio and video processing method based on deep learning according to an embodiment of the present application;

fig. 2 shows a structure diagram of an audio-video processing device based on deep learning according to an embodiment of the present application;

fig. 3 shows a structure diagram of an audio-video processing device based on deep learning according to another embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that such uses are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the term "include" and its variants are to be read as open-ended terms meaning "including, but not limited to".

The embodiment of the application provides an audio and video processing method based on deep learning, which can be applied to electronic equipment such as a mobile terminal, a personal computer and a tablet computer. As shown in fig. 1, the deep learning based audio and video processing method may include the following steps S101 to S104:

step S101, predicting the compressed audio and video stream through a deep learning and neural network to obtain predicted data of each frame;

step S102, according to the comparison between the predicted data of each frame and the original data of the audio/video stream, respectively obtaining the accuracy of the predicted related data and the accuracy of the predicted non-related data of each frame;

step S103, judging the level of the current deep learning and neural network prediction according to the accuracy of the relevant data and the accuracy of the non-relevant data;

and step S104, transmitting the predicted level to the terminal of the staff in a binary mode, and displaying the predicted level in a lighting bar form at the terminal.

The method comprises the steps of firstly, predicting compressed audio and video streams through deep learning and a neural network to obtain predicted data of each frame; respectively obtaining the accuracy of the predicted relevant data and the accuracy of the predicted non-relevant data of each frame according to the comparison between the predicted data of each frame and the original data of the audio and video stream; then judging the level of the current deep learning and neural network prediction according to the accuracy of the relevant data and the accuracy of the non-relevant data; and transmitting the predicted level to the terminal of the staff in a binary form, and displaying the predicted level in a form of lighting bar at the terminal. It can be seen that, in the embodiment of the application, the traditional function prediction scheme is replaced by the deep learning and the prediction of the neural network, the compressed audio and video frames are predicted, and the prediction efficiency can be improved.

The embodiment of the application provides a possible implementation manner, the bar lattices are vertical lattices with a plurality of rows and a column on the terminal, each row of the vertical lattices is an independent bar lattice, and the independent bar lattice in each row can be independently controlled to be turned on and off.

In the embodiment of the present application, a possible implementation manner is provided, in the above step S102, the accuracy of the relevant data and the accuracy of the non-relevant data of each predicted frame are respectively obtained by comparing each predicted frame of data with the original data of the audio/video stream, and specifically, the accuracy of the relevant data and the accuracy of the non-relevant data of each predicted frame can be respectively obtained by comparing each predicted frame of data with the original data of the audio/video stream by using the following formula:

wherein L (i) representsThe accuracy of relevant data of the ith frame predicted by the neural network is over-deeply learned; f (i) represents the accuracy of the uncorrelated data of the ith frame predicted by the deep learning and neural network; wherein if

Then L (i) ═ 1, if

According to the embodiment of the application, the predicted data accuracy and the predicted non-relevant data accuracy of each frame are obtained respectively by comparing the predicted data of each frame with the original data, and the accuracy is further divided into two parts to analyze the deep learning and neural network algorithm.

In the embodiment of the present application, a possible implementation manner is provided, where in step S103, the current deep learning and neural network prediction level is determined according to the related data accuracy and the non-related data accuracy, and in step S104, the prediction level is transmitted to the terminal of the worker in a binary form, and specifically, the following formula may be used to obtain the binary data transmitted to the terminal of the worker according to the related data accuracy and the non-related data accuracy:

According to the method and the device for transmitting the binary data, the binary data transmitted to the terminal of the worker are obtained according to the accuracy of the relevant data and the accuracy of the non-relevant data, and the binary data is transmitted most quickly, most efficiently and most conveniently, so that the binary data represents two accuracy levels to be transmitted efficiently and conveniently.

In the embodiment of the present application, a possible implementation manner is provided, where in step S104, the predicted level is transmitted to the terminal of the worker in a binary form, and is displayed in the terminal in a form of a lighting bar, and specifically, the following formula may be used to control an independent bar on a vertical bar to light according to binary data received by the terminal:

wherein k represents the number of independent lattice control lightening on the vertical lattice bar; k represents the total number of individual bars on the vertical bars.

In the embodiment of the application, the independent bars on the vertical bar lattice bars are lighted by the lighting number pairs obtained in the steps from bottom to top, namely the independent bars are changed from non-filling color to white filling; and then the staff can know the grade of the current deep learning and neural network prediction by observing the number of the lighted independent grids on the terminal, and further optimize or improve the deep learning and neural network algorithm, so that the grade is higher and the algorithm is more complete.

It should be noted that, in practical applications, all the possible embodiments described above may be combined in a combined manner at will to form possible embodiments of the present application, and details are not described here again.

Based on the same inventive concept, the embodiment of the application further provides an audio and video processing device based on deep learning.

Fig. 2 shows a block diagram of an audio-video processing device based on deep learning according to an embodiment of the present application. As shown in fig. 2, the deep learning based audio-video processing device may include a prediction module 210, a comparison module 220, a judgment module 230, and a transmission module 240.

The prediction module 210 is configured to predict the compressed audio/video stream through a deep learning and neural network, so as to obtain predicted data of each frame;

the comparison module 220 is configured to compare the predicted data of each frame with the original data of the audio/video stream to obtain the accuracy of the predicted related data and the accuracy of the predicted non-related data of each frame;

a judging module 230, configured to judge a level of the current deep learning and neural network prediction according to the accuracy of the relevant data and the accuracy of the non-relevant data;

and a transmission module 240 for transmitting the predicted level to the terminal of the staff member in a binary form, and displaying the predicted level in a form of a lighting bar at the terminal.

The embodiment of the application provides a possible implementation manner, the bar lattices are vertical lattices with a plurality of rows and a column on the terminal, each row of the vertical lattices is an independent bar lattice, and the independent bars in each row can be independently controlled to be turned on and off.

In an embodiment of the present application, a possible implementation manner is provided, and the comparing module 220 shown in fig. 2 is further configured to:

Then L (i) ═ 1, if

In the embodiment of the present application, a possible implementation manner is provided, and the transmission module 240 shown in fig. 2 is further configured to:

obtaining binary data transmitted to the staff terminal according to the accuracy of the relevant data and the accuracy of the non-relevant data by using the following formula:

wherein (C) ₂ Data in binary form representing transmission to a staff terminal; n represents the total frame number of the audio and video stream; Λ represents a logical and; () ₂ The numbers in parentheses are data in binary form.

A possible implementation manner is provided in the embodiment of the present application, as shown in fig. 3, the apparatus shown in fig. 2 above may further include:

a control module 310, configured to control the individual bars on the vertical bar to light up according to the binary data received by the terminal by using the following formula:

According to the embodiment of the application, the independent bars on the vertical bar lattice bar are controlled to be lightened according to the binary data received by the terminal, so that a worker can know the level of the current deep learning and neural network prediction according to the lightening condition of the independent bars on the vertical bar lattice bar, and then the deep learning and neural network algorithm is optimized or improved, so that the level is higher, and the algorithm is more perfect.

According to the audio and video processing device based on deep learning, firstly, compressed audio and video streams are predicted through the deep learning and neural network, and each predicted frame data is obtained; respectively obtaining the accuracy of the predicted relevant data and the accuracy of the predicted non-relevant data of each frame according to the comparison between the predicted data of each frame and the original data of the audio and video stream; then judging the level of the current deep learning and neural network prediction according to the accuracy of the relevant data and the accuracy of the non-relevant data; and transmitting the predicted level to the terminal of the staff in a binary form, and displaying the predicted level in a form of lighting bar at the terminal. It can be seen that, in the embodiment of the application, the traditional function prediction scheme is replaced by the deep learning and prediction of the neural network, the compressed audio and video frame is predicted, and the prediction efficiency can be improved.

It can be clearly understood by those skilled in the art that the specific working processes of the system, the apparatus, and the module described above may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, the detailed description is omitted here.

Those of ordinary skill in the art will understand that: the technical solution of the present application may be essentially or wholly or partially embodied in the form of a software product, where the computer software product is stored in a storage medium and includes program instructions for enabling an electronic device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application when the program instructions are executed. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (an electronic device such as a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the electronic device, the electronic device executes all or part of the steps of the method described in the embodiments of the present application.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present application; such modifications or substitutions do not depart from the scope of the present application.

Claims

1. An audio and video processing method based on deep learning is characterized by comprising the following steps:

respectively obtaining the accuracy of the predicted relevant data and the accuracy of the predicted non-relevant data of each frame according to the comparison between the predicted data of each frame and the original data of the audio and video stream;

transmitting the predicted level to a terminal of a worker in a binary form, and displaying the predicted level in a lighting bar form at the terminal;

the method comprises the following steps of comparing predicted data of each frame with original data of audio and video streams by using the following formula to respectively obtain the accuracy of predicted related data and the accuracy of predicted non-related data of each frame:

Then L (i) ═ 1, if

Then f (i) ═ 1; d _i (a) Representing a binary number on the a bit in the binary form data of the ith frame predicted by the deep learning and neural network; d _i，0 (a) Frame i binary representing original data of audio-video streamBinary number on the a-th bit in the format data; g _i (a) Representing a characteristic detection function, and if the binary number on the a bit in the ith frame binary form data of the original data of the audio and video stream is a characteristic number, reflecting the characteristic value of the audio and video stream as a function value G _i (a) If it is 1, the function value G is reversed _i (a)＝0；m _i The bit number of binary number in the ith frame binary form data of the original data of the audio and video stream is represented; | | represents the absolute value; [] ₁₀ Indicating that the values in parentheses are converted to decimal.

2. The deep learning-based audio and video processing method according to claim 1, wherein the bar lattice is a vertical bar lattice having a plurality of rows and a column on the terminal, each row of the vertical bar lattice is an independent bar lattice, and the independent bar lattice in each row can be independently controlled to be turned on and off.

3. The deep learning-based audio and video processing method according to claim 2, wherein the binary data transmitted to the staff terminal is obtained according to the accuracy of the relevant data and the accuracy of the non-relevant data by using the following formula:

4. The deep learning-based audio/video processing method according to claim 3, wherein the independent bars on the vertical bars are controlled to be lighted up according to the binary data received by the terminal by using the following formula:

5. An audio/video processing device based on deep learning, comprising:

the transmission module is used for transmitting the predicted level to a terminal of a worker in a binary mode and displaying the predicted level in a lighting bar form at the terminal;

wherein the comparison module is further configured to:

wherein l (i) represents the accuracy of the relevant data for the ith frame predicted by the neural network through deep learning; f (i) shows the learning by deep learning and neural networkThe non-correlated data accuracy of the predicted ith frame; wherein if

Then L (i) ═ 1, if

6. The deep learning-based audio/video processing device according to claim 5, wherein the bar is a vertical bar having a plurality of rows and a column on the terminal, each row of the vertical bar is an independent bar, and the independent bar in each row can be independently controlled to be turned on and off.

7. The deep learning based audio-video processing device according to claim 6, wherein the transmission module is further configured to:

8. The deep learning based audio-video processing device according to claim 7, further comprising: