CN102364952B

CN102364952B - Method for processing audio and video synchronization in simultaneous playing of plurality of paths of audio and video

Info

Publication number: CN102364952B
Application number: CN 201110327166
Authority: CN
Inventors: 胡开荆; 李群巍
Original assignee: ZHEJIANG WANPENG NETWORK TECHNOLOGY Co Ltd
Current assignee: Zhejiang Wanpeng Digital Intelligence Technology Co ltd
Priority date: 2011-10-25
Filing date: 2011-10-25
Publication date: 2013-12-25
Anticipated expiration: 2031-10-25
Also published as: CN102364952A

Abstract

The invention relates to a method for processing audio and video synchronization in the simultaneous playing of a plurality of paths of audio and video. The requirements of multi-user communication application on the simultaneous synchronization of the plurality of paths of audio and video cannot be met by a conventional audio and video synchronization technology. The method provided by the invention comprises the following steps of: each user acquires own audio and video data, compresses the acquired audio and video data into audio and video compressed packets, marks timestamps on the audio and video compressed packets and transmits the audio and video compressed packets to a server; the server decompresses and mixes the audio and video compressed packets received from each user, records the timestamps corresponding to all the audio compressed packets participating in the mixing in mixing results, compresses the audio compressed packets into mixed compressed packets, transmits the mixed compressed packets to clients and directly transmits the video compressed packets to the clients; and after receiving the mixed compressed packets and the video compressed packets, each client decompresses the mixed compressed packets, sequentially plays the decompressed audio data, and displays video frames in corresponding video compressed packets according to the principle of driving video by audio. By the method, synchronization relationships between all the audio and the video can be integrally stored.

Description

When playing simultaneously, processes a kind of multichannel audio-video frequency the method for audio-visual synchronization

Technical field

The invention belongs to technical field of computer multimedia, relate to the method for after Internet Transmission, multichannel audio-video frequency being processed, process the method for audio-visual synchronization simultaneously when specifically a kind of multichannel audio-video frequency is play.

Background technology

Along with the develop rapidly of current the Internet broadband technology and multimedia information technology, network-multimedia application has become the important content of internet application.Particularly, in the network teleconference, due to the interbehavior related between many people, need to be play multichannel audio-video frequency simultaneously.Now each road audio frequency and video all needs synchronously, otherwise the effect that can't accomplish " labial synchronization ", the fluency that impact is linked up.Traditional audio-visual synchronization technology is by timestamp of each mark of audio frequency and video bag, when playing, according to this timestamp, carries out synchronously.This mode can only work in the situation of a road audio frequency and a road video, in the situation that multichannel voice frequency and multi-channel video can't work, can not meet the many people of this class of video conference and link up application multichannel audio-video frequency is carried out to synchronous requirement simultaneously.

Summary of the invention

The objective of the invention is for the deficiencies in the prior art, a kind of multi-channel video synchronous method driven of playing based on audio frequency is provided.

The concrete steps of the inventive method are:

Step (1). each user obtains respectively audio, video data separately and Voice & Video is compressed separately; By the voice data of collection take 10～120 milliseconds become audio data unit as unit, each audio data unit is compressed into to the audio compression bag, each audio compression bag mark collection client machine timestamp constantly; Each frame in video data is compressed into to the video compression bag, each video compression bag mark collection client machine timestamp constantly; Each audio compression bag and each video compression bag are sent to server;

The method that each user obtains respectively audio, video data separately comprises by the equipment collection and obtains from media file; As by the equipment collection, the moment of described timestamp for gathering; As obtained from media file, playback of media files or decompress(ion) assembly can be data setup times stamp, and this timestamp is the moment that relative media file starts, and are converted to that to take current computer be the timestamp of standard constantly.

Step (2). audio mixing after each user's that server will receive audio compression bag decompress(ion), then in the audio mixing result, record the timestamp corresponding to audio compression bag of all participation audio mixings, be compressed into the audio mixing compressed package, send to client; The video compression bag directly sends to client.

N user U ₁～U _n, each user has a road audio frequency, total N road audio A ₁～A _n; Server needs audio mixing to go out N+1 road audio frequency, respectively:

The 0th tunnel. comprise all audio frequency, be designated as M ₀,

The 1st tunnel. except A ₁outer other all audio frequency, be designated as M ₁,

The 2nd tunnel. except A ₂outer other all audio frequency, be designated as M ₂,

By that analogy,

The N road. except A _nouter other all audio frequency, be designated as MN.

Generating Mei road audio frequency all needs the timestamp of the N that it is corresponding or source, N-1 road audio frequency to write in this road audio frequency, will have N or N-1 timestamp in this audio frequency, and the corresponding source of these timestamps audio frequency.

After generating this N+1 road audio frequency, by M ₀send to all users that do not send audio frequency, M ₁send to U ₁, M ₂send to U ₂, by that analogy, send to each user's audio content not comprise this user's audio frequency.

Step (3). each client, after audio mixing compressed package and video compression bag, by played in order after audio mixing compressed package decompress(ion), then according to the principle of audio driven video, shows the frame of video in corresponding video compression bag.

Each client to content be the N road video compression bag that a road audio mixing compressed package and server forward; During broadcasting, by the audio driven video, undertaken, audio compression bag of every broadcasting, record all timestamps (U, A) that comprise in this audio compression bag While playing X user's video, take out video time stamp (U corresponding to this road video frame to be played _x, V _x), take out the same user's of the audio frame of playing recently timestamp (U simultaneously _x, A _x), to V _xand A _xcompare, if V _xbe more than or equal to A _x, mean that video content after audio content, can play, and if V _xbe less than A _x, according to audio driven video principle, mean that this frame of video does not also arrive the moment of playing, therefore need to wait for that broadcasting judgement next time determines whether playing.

The inventive method be take audio time stamp as tie, by multi-channel video and audio sync, reach all videos all can with the effect of audio frequency " labial synchronization ".The inventive method sound intermediate frequency is when the server audio mixing, do not use single timestamp to carry out audio mixing compressed package of mark, but the timestamp that will participate in the multichannel voice frequency of this audio mixing compressed package is all preserved, as the timestamp of audio mixing compressed package, so just intactly preserved the synchronized relation between all Voice & Videos.

Embodiment

Process the method for audio-visual synchronization when a kind of multichannel audio-video frequency is play simultaneously, concrete steps are:

Step (1). each user obtains respectively audio, video data separately and Voice & Video is compressed separately; By the voice data of collection take 10～120 milliseconds become audio data unit as unit, each audio data unit is compressed into to the audio compression bag, each audio compression bag mark collection client machine timestamp constantly; Each frame in video data is compressed into to the video compression bag, each video compression bag mark collection client machine timestamp constantly; Each audio compression bag and each video compression bag are sent to server.

Video processing is that the video of input be take to frame as unit, after using the video encoder compression, according to network condition, cuts into the size (being generally 400～1400 bytes) of suitable transmission, with together with the timestamp of this frame of video, sends to server.Sort and judge whether that the packet loss phenomenon is arranged in transmitting procedure for convenience of receiving terminal, audio frequency and video Bao Jun is with sequence number.Sequence number is that 2 bytes increase progressively, and over after maximum, from 0, restarts.For improving the user of bandwidth when poor, experience, audio, video data sends with different connections, and when bandwidth is inadequate, the audio frequency connection, because the relative video connection of data is fewer, easily is protected like this.And our mutual Main Means is by audio frequency, in general video is supplementary means, does like this and can allow audio frequency more smooth, reduces the impact on the user.

N user U ₁, U ₂..., U _n, each user has a road audio frequency, and total N road audio frequency, be respectively A ₁, A ₂..., A _n; Server needs audio mixing to go out N+1 road audio frequency, respectively:

The 0th tunnel. comprise all audio frequency, be designated as M ₀,

...、

The N road. except A _nouter other all audio frequency, be designated as M _n.

Generating Mei road audio frequency all needs the timestamp of the N that it is corresponding or source, N-1 road audio frequency to write in this road audio frequency, will have N or N-1 timestamp in this audio frequency, and the corresponding source of these timestamps audio frequency.M for example ₀will comprise (U ₁, A ₁) (U ₂, A ₂) ... (U _n, A _n), M ₁will comprise (U ₂, A ₂) (U ₃, A ₃) ... (U _n, A _n).

After generating this N+1 road audio frequency, by M ₀send to all users that do not send audio frequency, M ₁send to U ₁, M ₂send to U ₂, by that analogy, send to each user's audio content not comprise this user's audio frequency, avoid echogenicity in these user's loudspeaker.

The uncertainty of Internet Transmission is more intense, main manifestations have following some: the uncertainty of data packet disorder and reception delay.While by TCP, sending data, the data that different connections are sent may be different from the order sent when receiving, and while by UDP, sending data, the order that different packets arrives is also unwarrantable, and this is the out of order characteristic of packet.No matter use TCP or UDP, the time that the packet arrival the other side computer sent consumes is all uncertain, can change along with the Internet Transmission quality condition, generally may fluctuate in 1 millisecond to 500 milliseconds, even likely reach the several seconds when network is poor.Due to above two characteristics, need to the audio, video data received be sorted respectively and buffered.The foundation of sequence is the sequence number in packet, and the time of buffering will determine according to network delay.Network delay is less, means that network condition is better, can suitably reduce so the voice data of buffering, obtains better real-time.Network delay is larger, mean that network condition is poorer, we will suspend broadcasting so, until the voice data duration of buffering equals the duration of network delay, although sacrificed like this real-time, but improved the fluency of playing, reduced when playing because buffering is too short, after data have been played, do not had the data can be by the phenomenon of the card caused.

Claims

1. process the method for audio-visual synchronization simultaneously when a multichannel audio-video frequency is play, it is characterized in that the concrete steps of the method are:

The method that each user obtains respectively audio, video data separately comprises by the equipment collection and obtains from media file; As by the equipment collection, the moment of described timestamp for gathering; As obtained from media file, playback of media files or decompress(ion) assembly can be data setup times stamp, and this timestamp is the moment that relative media file starts, and are converted to that to take current computer be the timestamp of standard constantly;

Step (2). audio mixing after each user's that server will receive audio compression bag decompress(ion), then in the audio mixing result, record the timestamp corresponding to audio compression bag of all participation audio mixings, be compressed into the audio mixing compressed package, send to client; The video compression bag directly sends to client;

The 0th tunnel. comprised all audio frequency, be designated as M0,

By that analogy,

The N road. except A _nouter other all audio frequency, be designated as M _n;

Generating Mei road audio frequency all needs the timestamp of the N that it is corresponding or source, N-1 road audio frequency to write in this road audio frequency, will have N or N-1 timestamp in this audio frequency, and the corresponding source of these timestamps audio frequency;

After generating this N+1 road audio frequency, by M ₀send to all users that do not send audio frequency, M ₁send to U ₁, M ₂send to U ₂, by that analogy, send to each user's audio content not comprise this user's audio frequency;

Step (3). each client, after audio mixing compressed package and video compression bag, by played in order after audio mixing compressed package decompress(ion), then according to the principle of audio driven video, shows the frame of video in corresponding video compression bag;

Each client to content be the N road video compression bag that a road audio mixing compressed package and server forward; During broadcasting, by the audio driven video, undertaken, audio compression bag of every broadcasting, record all timestamps (U, A) that comprise in this audio compression bag; While playing X user's video, take out video time stamp (U corresponding to this road video frame to be played _x, V _x), take out the same user's of the audio frame of playing recently timestamp (U simultaneously _x, A _x), to V _xand A _xcompare, if V _xbe more than or equal to A _x, mean that video content after audio content, can play, and if V _xbe less than A _x, according to audio driven video principle, mean that this frame of video does not also arrive the moment of playing, wait for that broadcasting judgement next time determines whether playing.