CN108924631B

CN108924631B - Video generation method based on audio and video shunt storage

Info

Publication number: CN108924631B
Application number: CN201810675756.4A
Authority: CN
Inventors: 吴宣辉; 胡松涛; 卢锡芹
Original assignee: Hangzhou Xujian Science And Technology Co ltd
Current assignee: Hangzhou Xujian Science And Technology Co ltd
Priority date: 2018-06-27
Filing date: 2018-06-27
Publication date: 2021-07-06
Anticipated expiration: 2038-06-27
Also published as: CN108924631A

Abstract

The invention discloses a video generation method based on audio and video shunt storage, which comprises the following steps: step 1, performing caching processing after receiving an audio stream A, a video stream A, an audio stream B and a video stream B; step 2, selecting the time needing to be synthesized to synthesize the video; step 3, selecting the audio and video stream A as an audio and video synthesis reference basis, mixing the audio stream A and the audio stream B, and extracting the resolution of the video stream A to zoom the video stream B into the parameter of the video stream A; when the audio stream A is in a sound state, the video A picture is used, when the audio stream B is in a sound state, the video B picture is used, when silence is detected or the audio stream A and the audio stream B are both in the sound state, a reference basic video A picture is used, then synchronization processing is carried out, a timestamp is calculated and adjusted, audio and video synchronization is guaranteed, and finally the video file is incorporated. The video presentation effect of the technical scheme of the invention is novel, and the picture with sound is displayed, which is similar to voice following.

Description

Video generation method based on audio and video shunt storage

Technical Field

The invention relates to the technical field of computer information data processing, in particular to a video generation method based on audio and video shunt storage.

Background

The existing two-party call video generation mode is as follows: as shown in fig. 1, two video files are generated, sound is mixed, and video is not mixed, that is, a single-picture video file is generated. Therefore, the current video recording effect is that both voices can be heard, only one image can be seen (the image and the voice of the local side are synchronized), and one voice has no video picture corresponding to the other voice. In order to solve the problem, an upgrade scheme is required to support the mixing of video pictures and achieve the optimal video experience effect, that is, all the sound and the pictures can be matched synchronously.

Disclosure of Invention

The invention aims to provide a video generation method based on audio and video shunt storage.A temporary storage block stores data, calculates an idle CPU, ensures that the overall performance of a server is not influenced, and is based on a path of routing reference, a path of video coding and decoding, reduces the performance overhead and effectively synthesizes a video file; to solve the problems set forth in the background art described above.

In order to achieve the purpose, the invention provides the following technical scheme:

a video generation method based on audio and video shunt storage comprises the following steps:

step (1), buffering and storing the audio and video

And after receiving the audio stream A and the video stream A as well as the audio stream B and the video stream B, performing caching processing and storing the audio stream A and the video stream A and the audio stream B into a storage block, so that the problem that the subsequent video synthesis is caused by overlarge CPU load due to continuous application of CPU resource scheduling under the condition of less CPU resources can be avoided.

Step (2) selecting the time needing to be synthesized to synthesize the video

Calculating and selecting an idle CPU, and starting the synthesis of the video file;

step (3) audio and video coding and decoding processing

Selecting an audio/video stream A as an audio/video synthesis reference basis, mixing the audio stream A and the audio stream B, and extracting the resolution of the video stream A to scale the video stream B into the parameters of the video stream A, namely the same resolution, code rate and frame rate; when the audio stream A is in a sound state, the video A picture is used, when the audio stream B is in a sound state, the video B picture is used, when silence is detected or the audio stream A and the audio stream B are both in the sound state, a reference basic video A picture is used, then synchronization processing is carried out, a timestamp is calculated and adjusted, audio and video synchronization is guaranteed, and finally the video file is incorporated.

Preferably, in the step (1), the audio and video are buffered according to the following specific flow:

(1.1) pre-establishing storage blocks, defining the size of each storage block to be 256M, and determining the number of establishment according to actual requirements, wherein each storage block has a respective number;

(1.2) dividing the storage blocks for use, for example, the storage blocks of the audio and video A are numbered from 1A to 100A, and the storage blocks of the audio and video B are numbered from 1B to 100B, and reusing the storage blocks;

(1.3) putting the audio and video packaging data into a storage block, wherein the audio and video data are distinguished according to different identification symbols, such as an audio identifier $ a and a video identifier $ v;

(1.3.1) storing the prior video frame data into a storage block, storing the SPS, PPS and I frames of the video firstly, and then storing the audio data to prevent the situation that the synthesized video has no sound and no video;

(1.3.2) extracting the time stamp of each RTP packet by the audio stream, and putting each audio RTP packet into a storage block according to a packet format of '4-byte packet length + 4-byte time stamp description + packet body'; wherein, the packet body refers to a complete RTP packet; 4 bytes of packet length +4 bytes of timestamp description + inclusion, which is a private customized packet format;

(1.3.3) the RTP packets are required to be split by the video stream, complete frame data are assembled, the time stamp of each frame is recorded, and the American frame data are put into a storage block according to the packet format of '4-byte packet length + 4-byte time stamp description + bag body'; wherein: the packet body is obtained by extracting pure H264 data from an RTP packet and putting the data into the RTP packet;

namely, the video body is the RTP packet with the pure H264 data extracted and put in, and the audio body is the complete RTP packet.

Preferably, the CPU performance calculation in step (2) is performed according to the following process:

(2.1), CPU performance typically comprises 3 points: running queues, CPU utilization and context switching;

(2.2) the run queue preferably does not exceed 3 for each CPU, e.g., 6 if a dual core CPU; if the queue remains above 3 for a long period of time, indicating that none of the processes can respond immediately to the cpu while running, it may be necessary to consider upgrading the cpu. In addition, the utilization rate of the cpu in full load operation is preferably that user space is kept at 65% -70%, system space is kept at 30%, and idle space is kept at 0% -5%;

(2.3) checking the overall system running state and the CPU utilization rate through a top command, and checking the process queue length and the average load state, the process creating average value and the number of context switching by using an sar;

and (2.4) when the occupancy rate of the CPU is not more than 50 percent (namely, the CPU is in the optimal state), starting the synthesis of the video file.

Preferably, the audio/video reference basis selection policy in step (3) is specifically as follows:

(3.1) calculating energy values of the audio stream A and the audio stream B, and selecting the higher total energy as a reference basis; such as selecting audio stream a;

(3.2) counting the time point and the duration of each piece of sound of the audio stream A and the audio stream B;

(3.3) decoding the audio stream A and the audio stream B, adding and mixing the audio streams; if the audio is encoded into Alaw and Ulaw of G.711, decoding the audio into PCM, and performing additive sound mixing;

(3.4) with the video stream A as a reference, analyzing the resolution, re-decoding the video stream B, performing scaling by processing YUV, and re-encoding, wherein the resolution is finally consistent with that of the video stream A;

and (3.5) audio and video synthesis, namely calculating that the audio stream A is in a sound state according to a time axis, selecting the I frame and the subsequent frame of the video stream A at the corresponding time point to be combined into a video file, selecting the I frame and the subsequent frame of the video stream B at the corresponding time point to be combined into the video file if the audio stream B is calculated to be in the sound state, and selecting the I frame and the subsequent frame of the reference basic video stream A at the corresponding time point to be combined into the video file if the mute state is calculated or the audio stream A and the audio stream B are both in the sound state.

Preferably, the specific method of the synchronization process in the step (3) is as follows:

(4.1) letting the base timestamp be the timestamp of the first frame;

(4.2) for the audio time stamp calculation process, the relative time stamp Ta of the audio ═ (time stamp per frame-base time stamp) ÷ (8000 ÷ 1000);

(4.3) for the calculation process of the video time stamp, the relative time stamp Tv of the video is (time stamp-base time stamp per frame) ÷ (90000 ÷ 1000);

(4.4) calculating a timestamp deviation formula of the front and rear frames of the audio and video by considering the unstable packet sending (namely the packet sending of the audio and video equipment is too slow): (the timestamp of the next frame-the timestamp of the previous frame) ÷ (8000 ÷ 1000), if the deviation of the timestamp is greater than 1000(1 second), it means that the packet is sent too slowly, the basic timestamp needs to be adjusted, the formula is that the basic timestamp is adjusted to be the basic timestamp plus the timestamp of the next frame-the timestamp of the previous frame, and then the relative timestamps Ta and Tv of the audio and video are adjusted, so as to control the audio and video synchronization;

and (4.5) controlling the difference value of the relative time stamps (Ta and Tv) of the adjacent audios and videos of the received RTP within 1000(1 second), so that the audios and videos are synchronized.

Name interpretation

Mixing: mixing the multiple audio streams into one audio;

resolution ratio: the number of pixel points included in a unit inch;

a CPU: the central processing unit is an ultra-large scale integrated circuit, and is an operation Core (Core) and a control Core (control unit) of a computer;

YUV: a color coding method (belonging to PAL) adopted by European television system is the color space adopted by PAL and SECAM analog color television system;

SPS: sequence Paramater Set, also called Sequence parameter Set, holds a Set of global parameters of a Coded video Sequence (Coded video Sequence);

PPS: picture parameter Set, PPS is stored in the header of a video file, usually together with SPS, in a package format;

i frame (I frame): also known as intra pictures, I-frames, which are usually the first frame of each GOP (a video compression technique used by MPEG), are moderately compressed and serve as reference points for random access, which can be referred to as pictures;

RTP (Real-time Transport Protocol): the real-time transport protocol is a network transport protocol, which was published by the multimedia transport working group of IETF in RFC 1889 in 1996 and later updated in RFC 3550;

RTP packet: defining a packet format of RTP transmission;

time stamping: a complete, verifiable piece of data, usually a sequence of characters, that indicates that a piece of data existed before a particular time, uniquely identifies the time of the moment.

Compared with the prior art, the invention has the beneficial effects that:

1. compared with a mode of directly generating a video file, the technical scheme of the invention increases the use of the storage block, temporarily stores audio and video data and is beneficial to generating a video at a required moment.

2. The use of the storage block can select the work of starting the generation of the video file when the CPU is idle, so that the use of the CPU is reasonably distributed, and other tasks needing CPU resources are not influenced.

3. Compared with a video screen mixing mode, the technical scheme of the invention takes the audio data with high energy value as the basic reference path, the high energy value means that the time occupied by sound is longer, and the video data with sound time corresponding to the other path of decoding and encoding is less, so that the used CPU resource is less and the synthesized video is faster.

4. The video presentation effect of the technical scheme of the invention is novel, and the picture with sound is displayed, which is similar to voice following.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating a conventional method for generating a double-talk video;

fig. 2 is a schematic structural diagram of a video generation method based on audio and video shunt storage according to the present invention;

the figures in the drawings are marked with numbers: the device comprises a storage block (1), an idle CPU (2), audio and video coding and decoding processing (3), synchronous processing (4) and a video file (5).

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 2: the invention provides a specific embodiment of a video generation method based on audio and video shunt storage, which comprises the following steps:

step (1), buffering and storing the audio and video

And after receiving the audio stream A and the video stream A as well as the audio stream B and the video stream B, performing caching processing, and storing the audio stream A and the video stream A as well as the audio stream B and the video stream B into the storage block (1), so that the problem that the subsequent video synthesis is caused due to overlarge CPU load caused by continuously applying CPU resource scheduling under the condition of less CPU resources can be avoided.

Step (2) selecting the time needing to be synthesized to synthesize the video

Calculating and selecting an idle CPU (2) and starting the synthesis of the video file;

step (3) audio and video coding and decoding processing (3)

Selecting an audio/video stream A as an audio/video synthesis reference basis, mixing the audio stream A and the audio stream B, and extracting the resolution of the video stream A to scale the video stream B into the parameters of the video stream A, namely the same resolution, code rate and frame rate; when the audio stream A is in a sound state, using a video A picture, when the audio stream B is in a sound state, using a video B picture, detecting that the audio stream A is silent or the audio stream A and the audio stream B are both in the sound state, using a reference basic video A picture, then carrying out synchronization processing (4), calculating and adjusting a time stamp, ensuring audio and video synchronization, and finally integrating into a video file (5).

In the step (1), the audio and video are buffered and stored according to the following specific flow:

(1.1) pre-establishing the storage blocks (1), defining the size of each storage block to be 256M, and determining the number of establishment according to actual requirements, wherein each storage block has a respective number;

(1.2) dividing the storage block (1) for use, for example, the storage block of the audio/video A is numbered from 1A to 100A, the storage block of the audio/video B is numbered from 1B to 100B, and the storage block (1) is recycled;

(1.3) putting the audio and video packaging data into the storage block (1), wherein the audio and video data are distinguished according to different identification symbols, such as an audio identifier $ a and a video identifier $ v;

(1.3.1) storing the prior video frame data into a storage block (1), storing the SPS, PPS and I frames of the video firstly, and then storing the audio data to prevent the situation that the synthesized video has no video sound;

(1.3.2) extracting the time stamp of each RTP packet by the audio stream, and putting each audio RTP packet into the storage block (1) according to the packet format of 4-byte packet length, 4-byte time stamp description and packet body; wherein, the packet body refers to a complete RTP packet; 4 bytes of packet length +4 bytes of timestamp description + inclusion, which is a private customized packet format;

(1.3.3) the video stream needs to be split into RTP packets, complete frame data is assembled, the time stamp of each frame is recorded, and the American frame data is put into a storage block (1) according to the packet format of '4 bytes packet length +4 bytes time stamp description + bag body'; wherein: the packet body is obtained by extracting pure H264 data from an RTP packet and putting the data into the RTP packet;

The CPU performance calculation in the step (2) is specifically performed according to the following process:

The audio and video reference basis selection strategy in the step (3) is specifically as follows:

The specific method of the synchronization processing (4) in the step (3) is as follows:

(4.1) letting the base timestamp be the timestamp of the first frame;

The invention has the beneficial effects that:

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A video generation method based on audio and video shunt storage is characterized by comprising the following steps:

step (1), buffering and storing the audio and video

Receiving an audio stream A, a video stream A, an audio stream B and a video stream B, performing caching processing, and storing the audio stream A, the video stream A, the audio stream B and the video stream B in a storage block;

step (2) selecting the time needing to be synthesized to synthesize the video

step (3) audio and video coding and decoding processing

Selecting an audio stream A as an audio reference base, selecting a video stream A as a video reference base, mixing the audio stream A and the audio stream B, extracting the resolution of the video stream A, and scaling the resolution of the video stream B into the resolution of the video stream A; when the audio stream A is in a sound state, using a video A picture, when the audio stream B is in a sound state, using a video B picture, detecting that the audio stream A and the audio stream B are in a sound state or in a mute state, using a video to refer to the basic video A picture, then carrying out synchronous processing, calculating and adjusting a timestamp, ensuring audio and video synchronization, and finally integrating the timestamp into a video file;

(1.2) dividing the storage block for use and reusing the storage block;

(1.3) putting the audio and video packaging data into a storage block, and distinguishing the audio and video data according to different identification symbols;

(1.3.2) extracting the time stamp of each RTP packet by the audio stream, and putting each audio RTP packet into a storage block according to a packet format of '4-byte packet length + 4-byte time stamp description + packet body'; wherein, the packet body refers to a complete RTP packet;

(1.3.3) the video stream needs to be split into RTP packets, complete frame data is assembled, the time stamp of each frame is recorded, and each frame of data is put into a storage block according to the packet format of '4 bytes packet length +4 bytes time stamp description + bag body'; wherein: the packet body is obtained by extracting pure H264 data from an RTP packet and putting the data into the RTP packet;

the step (2) specifically comprises the following steps:

(2.1) checking the overall system running state and the CPU utilization rate through a top command, and checking the process queue length and the average load state, the process creating average value and the number of context switching by using an sar;

(2.2) when the occupancy rate of the CPU is monitored to be not more than 50%, starting the synthesis of the video files;

the step (3) specifically comprises the following steps:

(3.1) counting the time point and the duration of each piece of sound of the audio stream A and the audio stream B;

(3.2) decoding the audio stream A and the audio stream B, adding and mixing the audio streams;

(3.3) taking the video stream A as a video reference base, analyzing the resolution, re-decoding the video stream B, performing scaling by processing YUV, and re-encoding, wherein the resolution is finally kept consistent with that of the video stream A;

(3.4) audio and video synthesis, namely calculating that the audio stream A is in a sound state according to a time axis, selecting an I frame of the video stream A at a corresponding time point and a subsequent frame in the duration to be combined into a video file, selecting an I frame of the video stream B at the corresponding time point and a subsequent frame in the duration to be combined into the video file if the audio stream B is calculated to be in the sound state, and selecting an I frame of the video reference basic video stream A at the corresponding time point and a subsequent frame in the duration to be combined into the video file if the mute state is calculated or the audio stream A and the audio stream B are both in the sound state;

the specific method of the synchronization processing is as follows:

(4.1) letting the base timestamp be the timestamp of the first frame;

(4.4) calculating the time stamp deviation of the front frame and the rear frame of the audio and video by the following formula: (next frame timestamp-timestamp of previous frame) ÷ (8000 ÷ 1000), if the deviation of the timestamp is greater than 1000, the audio and video equipment sends out the packet too slowly, adjust the basic timestamp, the formula is to adjust the basic timestamp ═ basic timestamp + next frame timestamp-timestamp of previous frame, and then adjust relative timestamps Ta and Tv of the audio and video, control the audio and video to be synchronous;

(4.5) the difference value of the relative time stamps Ta and Tv of the adjacent audios and videos of the received RTP packet is within 1000, and the audios and the videos are synchronized.