CN112584216B

CN112584216B - Lip sound synchronization method and device

Info

Publication number: CN112584216B
Application number: CN201910937097.1A
Authority: CN
Inventors: 黄凡夫; 辛安民
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2022-09-30
Anticipated expiration: 2039-09-29
Also published as: CN112584216A

Abstract

The application provides a lip sound synchronization method and a lip sound synchronization device, wherein the method comprises the following steps: receiving a video frame and an audio frame of a first device; and receiving a video synchronization parameter pair and an audio synchronization parameter pair of the first device; the video synchronization parameter pair comprises a corresponding relation between the acquisition time of video data and a time stamp of a video frame, and the audio synchronization parameter pair comprises a corresponding relation between the acquisition time of audio data and a time stamp of an audio frame; determining the acquisition time of a video frame of the first equipment based on the video synchronization parameter pair of the first equipment; determining the acquisition time of the audio frame of the first device based on the audio synchronization parameter pair of the first device; and synchronizing the video frame and the audio frame of the first equipment based on the acquisition time of the video frame and the acquisition time of the audio frame of the first equipment. The method can realize lip sound synchronization in the video conference.

Description

Lip sound synchronization method and device

Technical Field

The present disclosure relates to video conferencing and real-time communication, and more particularly, to a lip synchronization method and apparatus.

Background

Video conferencing refers to a conference in which people at two or more locations have a face-to-face conversation via a communication device and a network. By using the video conference system, the participants can hear the sound of other meeting places, see the image, the action and the expression of the participants in other meeting places, and also can send electronic demonstration contents, so that the participants have the feeling of being personally on the scene.

The video conference system relates to audio and video transmission, and the time of the audio and the video of the same participant reaching a receiving end may be different due to network factors in the transmission process, so that lip sound is asynchronous.

Disclosure of Invention

In view of the above, the present application provides a lip sound synchronization method and apparatus.

Specifically, the method is realized through the following technical scheme:

according to a first aspect of embodiments of the present application, there is provided a lip sound synchronization method, including:

receiving a video frame and an audio frame of a first device; and receiving a video synchronization parameter pair and an audio synchronization parameter pair of the first device; the video synchronization parameter pair comprises a corresponding relation between the acquisition time of video data and a time stamp of a video frame, and the audio synchronization parameter pair comprises a corresponding relation between the acquisition time of audio data and a time stamp of an audio frame;

determining the acquisition time of a video frame of the first equipment based on the video synchronization parameter pair of the first equipment; determining the acquisition time of the audio frame of the first equipment based on the audio synchronization parameter pair of the first equipment;

and synchronizing the video frame and the audio frame of the first device based on the acquisition time of the video frame and the acquisition time of the audio frame of the first device.

According to a first aspect of embodiments of the present application, there is provided a lip sound synchronization device, including:

a receiving unit for receiving a video frame and an audio frame of a first device; and receiving a video synchronization parameter pair and an audio synchronization parameter pair of the first device; the video synchronization parameter pair comprises a corresponding relation between the acquisition time of video data and a time stamp of a video frame, and the audio synchronization parameter pair comprises a corresponding relation between the acquisition time of audio data and a time stamp of an audio frame;

the determining unit is used for determining the acquisition time of the video frame of the first equipment based on the video synchronization parameter pair of the first equipment; determining the acquisition time of the audio frame of the first equipment based on the audio synchronization parameter pair of the first equipment;

and the processing unit is used for synchronizing the video frame and the audio frame of the first equipment based on the acquisition time of the video frame and the acquisition time of the audio frame of the first equipment.

The lip sound synchronization method of the embodiment of the application receives a video frame and an audio frame of first equipment; receiving a video synchronization parameter pair and an audio synchronization parameter pair of the first device, and determining the acquisition time of a video frame of the first device based on the video synchronization parameter pair of the first device; determining the acquisition time of the audio frame of the first equipment based on the audio synchronization parameter pair of the first equipment; furthermore, the video frame and the audio frame of the first device are synchronized based on the acquisition time of the video frame and the acquisition time of the audio frame of the first device, so that lip sound synchronization in the video conference is realized.

Drawings

Fig. 1 is a schematic flow chart illustrating a lip synchronization method according to an exemplary embodiment of the present application;

FIG. 2 is a flow chart illustrating another lip synchronization method according to another exemplary embodiment of the present application;

FIG. 3 is a flow chart illustrating another lip synchronization method according to another exemplary embodiment of the present application;

FIG. 4 is an architectural diagram illustrating a video conferencing system in accordance with an exemplary embodiment of the present application;

fig. 5A to 5D are schematic diagrams of an audio/video data forwarding scenario shown in an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of a lip synchronization device according to an exemplary embodiment of the present application;

fig. 7 is a schematic structural diagram of another lip sound synchronization apparatus according to still another exemplary embodiment of the present application;

fig. 8 is a schematic structural view of another lip sound synchronization apparatus according to still another exemplary embodiment of the present application;

fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

In order to make the technical solutions provided in the embodiments of the present application better understood and make the above objects, features and advantages of the embodiments of the present application more comprehensible, the technical solutions in the embodiments of the present application are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic flow chart of a lip synchronization method according to an embodiment of the present disclosure is shown in fig. 1, where the lip synchronization method may include the following steps:

it should be noted that, in this embodiment of the application, an execution subject of step S100 to step S120 may be any video conference device (referred to as a target device herein) in a video conference system architecture, and the target device may include, but is not limited to, an intelligent integrated terminal applied to a video conference, a mobile terminal integrated with a video conference APP, a PC (Personal Computer) integrated with a video conference APP, an MCU (multipoint control Unit), an SFU (Selective Forwarding Unit), or the like.

Step S100, receiving a video frame and an audio frame of first equipment; and receiving a video synchronization parameter pair and an audio synchronization parameter pair of the first device; the video synchronization parameter pair comprises the corresponding relation between the acquisition time of the video data and the time stamp of the video frame, and the audio synchronization parameter pair comprises the corresponding time between the acquisition time of the audio data and the time stamp of the audio frame.

In this embodiment of the application, the first device may be any video conference device in the video conference architecture except the target device, such as other video conference devices directly connected to the target device, or other video conference devices directly connected to the target device through the relay device.

It should be noted that, in this embodiment of the present application, direct connection between video conference devices means that switching between the video conference devices is not performed through other video conference terminals, but switching devices such as a router and a switch are allowed to exist between two directly connected video conference terminals, which is not repeated in the following of this embodiment of the present application.

In the embodiment of the application, the situation that the audio and video data are synthesized and then forwarded in the forwarding process is considered, and the timestamp of the audio and video data is changed due to the synthesis and the forwarding, so that the timestamps of the audio and video data acquired at the same moment are possibly different.

For example, in the process of forwarding audio data or/and video data provided by a device a, the audio data or/and video data (audio data and audio data synthesis, video data and video data synthesis) provided by another device (e.g., device B) may be combined and encoded on a certain relay device, repackaged, and forwarded after being stamped with a new timestamp, and at this time, timestamps (timestamps obtained by repackaging on the relay device) corresponding to the audio data and the video data collected at the same time provided by the device a may not be the same, and thus lip sound synchronization cannot be directly achieved according to the timestamps.

In addition, the absolute time corresponding to the difference of the timestamps printed when the audio and video data are packed and the difference of the acquisition time of the audio and video data can be generally considered as being consistent for the audio and video data acquired by the same device at different times.

For example, assuming that the audio data a1 and a2 are collected by the device a at the time T1 and T2, and the audio data a1 and a2 are packed and then time-stamped to be TS1 and TS2, the absolute time corresponding to the difference between TS2 and TS1 is generally consistent with the difference between a2 and a 1.

Wherein an absolute time corresponding to the difference of the timestamps may be determined based on the sampling frequency.

For example, for audio data, the ratio of the difference value of the time stamp to the audio sampling frequency is the absolute time (in milliseconds) corresponding to the difference value corresponding to the time stamp; the time unit of the acquisition time is usually microseconds, and therefore, after converting the ratio of the difference of the time stamps to the audio sampling frequency to microseconds (i.e., multiplying by 1000), it is usually consistent with the difference of the acquisition time instants of the audio data.

Based on this, for any device, by maintaining a corresponding relation between the audio data (or video data) acquisition time and the time stamp of the same time of the device, the method can determine the acquisition time of the audio data (or video data) in any audio data packet (or video data packet) of the device based on the time stamp in the audio data packet (or video data packet), thereby realizing lip sound synchronization based on the acquisition time of the audio data and the video data.

Correspondingly, in order to realize lip sound synchronization in a video conference, for any video conference device, the video conference device may generate a video frame and an audio frame according to the collected video data and audio data, and send the video frame and the audio frame to other video conference devices, and may also generate a video synchronization parameter pair including a correspondence between the collection time and a timestamp based on the collection time of the video data and the timestamp obtained when the video data is encoded and encapsulated, and generate an audio synchronization parameter pair including a correspondence between the collection time and the timestamp based on the collection time of the audio data and the timestamp obtained when the audio data is encoded and encapsulated, and send the video synchronization parameter pair and the audio synchronization parameter pair to other video conference devices.

For example, for video data acquired by any video conference device, when the acquisition time of the video data acquired by the video conference device is T1, and the timestamp in the video frame obtained by encoding and encapsulating the acquired video data by the video conference device is T2, the video conference device may generate a video synchronization parameter pair including the correspondence between T1 and T2.

The generation of audio synchronization parameter pairs is similarly available.

For example, for a directly connected video conference device, audio data and video data acquired by the device, and the video synchronization parameter pair and the audio synchronization parameter pair may be directly sent to the device; for the video conference equipment which is not directly connected, the audio data and the video data which are collected by the video conference equipment, the video synchronization parameter pair and the audio synchronization parameter pair can be sent to the transfer equipment, and the transfer equipment forwards the audio data, the video synchronization parameter pair and the audio synchronization parameter pair.

Accordingly, the target device receiving the video frame, the audio frame, and the video synchronization parameter pair and the audio synchronization parameter pair of the first device may include receiving the video frame, the audio frame, and the video synchronization parameter pair and the audio synchronization parameter pair transmitted by the first device; alternatively, receiving the video frame, the audio frame, and the video synchronization parameter pair and the audio synchronization parameter pair of the first device forwarded by the relay device between the target device and the first device may also be included.

It should be noted that the first device does not need to generate the video parameter pair (or audio parameter pair) each time a video frame or an audio frame is generated according to the captured video data (or audio data), for example, the first device may periodically generate a corresponding video synchronization parameter pair (or audio synchronization parameter pair) based on the capture time of the video data (or audio data) and a timestamp applied when the video data (or audio data) is encoded and encapsulated; accordingly, the first device may also periodically transmit the video synchronization parameter pair.

Step S110, determining the acquisition time of a video frame of first equipment based on a video synchronization parameter pair of the first equipment; and determining the acquisition time of the audio frame of the first device based on the audio synchronization parameter pair of the first device.

In this embodiment of the application, when receiving the video synchronization parameter pair and the audio synchronization parameter pair of the first device, the target device may store the video synchronization parameter pair and the audio synchronization parameter pair of the first device, determine the acquisition time of the received video data of the first device based on the video synchronization parameter pair of the first device, and determine the acquisition time of the received audio data of the first device based on the audio synchronization parameter pair of the first device.

In one example, for a received target video frame of a first device, determining an acquisition time of the video frame of the first device based on the video synchronization parameter pair of the first device may include:

for a received target video frame of the first device, determining the acquisition time of the target video frame by the following formula:

NTP_VN＝NTP_V0+1000*(TS_VN-TS_V0)/VIDEO_SYSTEM_CLOCK

the NTP _ V0 is a collection time included in a VIDEO synchronization parameter pair of the first device, the TS _ V0 is a timestamp included in the VIDEO synchronization parameter pair of the first device, the NTP _ VN is a collection time of a target VIDEO frame, the TS _ VN is a timestamp of the target VIDEO frame, and the VIDEO _ SYSTEM _ CLOCK is a VIDEO sampling frequency of the first device;

or/and (c) the first and/or second,

the determining the capturing time of the audio frame of the first device based on the audio synchronization parameter pair of the first device includes:

for a received target audio frame of the first device, determining a capture moment of the target audio frame by the following formula:

NTP_AN＝NTP_A0+1000*(TS_AN-TS_A0)/AUDIO_SYSTEM_CLOCK

the NTP _ a0 is a collection time included in the AUDIO synchronization parameter pair of the first device, the TS _ a0 is a timestamp included in the AUDIO synchronization parameter pair of the first device, the NTP _ AN is a collection time of the target AUDIO frame, the TS _ VN is a timestamp of the target AUDIO frame, and the AUDIO _ SYSTEM _ CLOCK is AN AUDIO sampling frequency of the first device.

Illustratively, the target device determines the collection time of the video data of the first device, and takes the collection time as NTP time as an example.

For any VIDEO frame of the first device (referred to as a target VIDEO frame herein) received by the target device, the target device may determine a difference value (i.e. TS _ VN-TS _ V0) between the timestamp of the target VIDEO frame (i.e. TS _ VN in the above formula) and a VIDEO synchronization parameter pair of the first device stored by itself (i.e. TS _ V0 in the above formula), and convert the difference value into an NTP time difference value (i.e. 1000 (TS _ VN-TS _ V0)/VIDEO _ SYSTEM _ CLOCK), since for the same device, the difference value of the acquisition time instants of the VIDEO data acquired at different times is consistent with the absolute time value corresponding to the difference value of the timestamps in the VIDEO frames corresponding to the VIDEO data acquired at the different times, so that the target device may determine, according to the acquisition time included in the VIDEO synchronization parameter pair of the first device, and the difference value, determining the acquisition time of the target video frame (i.e. the acquisition time of the video data used for encoding and packaging the obtained target video frame).

It should be noted that, considering that the collection time of the audio data (or video data) is usually based on the system time of the device, and when the device runs for a long time, the system time may gradually deviate, therefore, in order to optimize the lip synchronization effect, for any video conference device, the video conference device may periodically update the video synchronization parameter pair and the audio synchronization parameter pair of the device, and send the updated video synchronization parameter pair and audio synchronization parameter pair to other video conference devices, and the other video conference devices update the locally stored video synchronization parameter and audio synchronization parameter pair of the video conference device according to the received video synchronization parameter pair and audio synchronization parameter pair sent by the video conference device latest, so as to ensure the timeliness of the locally stored video synchronization parameter pair and audio synchronization parameter pair, therefore, the lip synchronization effect of the audio and video data of the corresponding video conference equipment based on the locally stored video synchronization parameter pairs and audio synchronization parameter pairs is optimized.

The above synchronous parameter pair

Accordingly, the first device may periodically update and transmit the video synchronization parameter pair (or the audio synchronization parameter pair), and the target device may determine the capturing timing of the video frame (or the audio frame) of the first device based on the newly received video synchronization parameter pair (or the audio synchronization parameter pair), and perform lip sound synchronization based on the determined capturing timing of the video frame (or the audio frame).

Step S120, synchronizing the video frame and the audio frame of the first device based on the acquisition time of the video frame and the acquisition time of the audio frame of the first device.

In this embodiment of the application, when the target device determines the capturing time of the video frame and the capturing time of the audio frame of the first device according to the above manner, the video frame and the audio frame of the first device may be synchronized based on the capturing time of the video frame and the capturing time of the audio frame of the first device, so as to ensure that the capturing times of the video frame and the audio frame of the first device for synthesis and rendering are consistent.

Illustratively, the video frame and the audio frame are consistent in acquisition time, which may allow a deviation not exceeding a preset threshold (which may be preset according to an actual scene) to exist between the acquisition time of the video frame and the acquisition time of the audio frame.

In one example, the synchronizing the video frame and the audio frame of the first device based on the capturing time of the video frame and the capturing time of the audio frame of the first device may include:

adjusting the size of a video frame buffer space and the size of an audio frame buffer space aiming at first equipment according to the acquisition time of a video frame and the acquisition time of an audio frame of the first equipment;

the video frame and the audio frame of the first device are cached, so that the video frame and the audio frame of the first device participating in synthesis and rendering are ensured to be consistent in acquisition time when the video frame and the audio frame of the first device are synthesized and rendered.

For example, when the target device determines the capturing time of the video frame and the capturing time of the audio frame of the first device in the above manner, the size of the video frame buffer space and the size of the audio frame buffer space for the first device may be adjusted according to the capturing time of the video frame and the capturing time of the audio frame of the first device.

For example, taking the example that the acquisition time of the video frame of the first device determined by the target device is earlier than the acquisition time of the audio frame, that is, in the video data and the audio data acquired at the same time, the video data is first sent to the target device, so that, in order to ensure that the video data and the audio data of the first device can be synchronized on the target device and avoid video data loss, the target device needs to ensure that the size of the video frame buffer space for the first device is enough to buffer the video frame corresponding to the difference value between the acquisition time of the video frame and the acquisition time of the audio frame.

In addition, considering that the delay of the network is variable, the difference between the capturing time of the video frame and the capturing time of the audio frame received by the target device by the first device may also be dynamically variable, and therefore, the target device needs to dynamically adjust the size of the video frame buffer space and the size of the audio frame buffer space for the first device based on the determined difference between the capturing time of the video frame and the capturing time of the audio frame of the first device.

As can be seen, in the flow of the method shown in fig. 1, the video synchronization parameter pair and the audio synchronization parameter pair are sent to other video conference devices by the video conference device, so that the other video conference devices that receive the video data and the audio data of the video conference device can determine the collection time of the received video data and the audio data of the video conference device based on the video synchronization parameter pair and the audio synchronization parameter pair, and further, based on the collection time of the video data and the audio data, lip sound synchronization is achieved.

Referring to fig. 2, as a possible implementation manner, the lip synchronization method may further include the following steps:

step S200, receiving an acquisition request aiming at the video frame and the audio frame of the first equipment, which is sent by the second equipment.

Step S210, sending the video frame and the audio frame of the first device to the second device; and sending the video synchronization parameter pair and the audio synchronization parameter pair of the first device to the second device, so that the second device synchronizes the video frame and the audio frame of the first device based on the video synchronization parameter pair and the audio synchronization parameter pair of the first device.

Illustratively, the second device is any video conference device in the video conference system except the target device and the first device.

In this embodiment, in order to improve the expansibility of the device, each video conference device in the video conference system can be used as a transfer device to transfer audio and video data of other video conference devices, in addition to acquiring and sending audio and video data and acquiring audio and video data of other video conference devices.

When the target device receives an acquisition request for a video frame and an audio frame of the first device, which is sent by the second device, the target device may send the video frame and the audio frame of the first device to the second device, on the one hand, and on the other hand, the target device may also send a video synchronization parameter pair and an audio synchronization parameter pair of the first device to the second device.

When the second device receives the video synchronization parameter pair and the audio synchronization parameter pair of the first device, the video frame and the audio frame of the first device may be synchronized based on the video synchronization parameter pair and the audio synchronization parameter pair, and specific implementation thereof may refer to the related description above, which is not described herein again in this embodiment of the present application.

Further, it is considered that during the video conference, the video conference device may request video data and audio data of a single other video conference device as required, or request video data and audio data of a plurality of other video conference devices, and when video data and audio data of a plurality of video conference devices are requested, it may also request to synthesize video data or/and audio data of a plurality of video conference devices (synthesis between video data or synthesis between audio data).

In one example, the transmitting the video frame of the first device to the second device and the transmitting the video synchronization parameter pair of the first device to the second device may include:

when the obtaining request comprises a forwarding instruction aiming at the video frame of the first equipment, sending the video frame of the first equipment to the second equipment, and sending the video synchronization parameter pair of the first equipment to the second equipment;

when the acquisition request comprises a synthesis instruction aiming at the video frame of the first device, synthesizing the video frame of the first device with other video frames participating in synthesis to generate a synthesized video frame, and adding a new time stamp for the synthesized video frame;

and sending the video frame to second equipment, replacing a timestamp included in the video synchronization parameter of the first equipment with a new timestamp, and sending the video synchronization parameter of the first equipment after the timestamp replacement to the second equipment.

In this example, when the target device receives an acquisition request for a video frame of the first device sent by the second device, and the acquisition request includes a forwarding instruction for the video frame of the first device, the target device may send the received video frame of the first device to the second device on the one hand, and may send a video synchronization parameter pair of the first device to the second device on the other hand.

It should be noted that, when the first device periodically sends the video synchronization parameter pair to the target device, the target device may also periodically send the video synchronization parameter pair of the first device to the second device.

For example, the period for the target device to send the video synchronization parameter pair of the first device to the second device may be the same as or different from the period for the first device to send the video synchronization parameter pair to the target device.

Illustratively, the target transmits the video synchronization parameter pair of the first device to the second device, and transmits the newly received video synchronization parameter pair of the first device.

In this example, when the target device receives an acquisition request sent by the second device for a video frame of the first device, and the acquisition request includes a composition instruction for the video frame of the first device, the target device may combine the video frame of the first device with other video frames participating in composition, generate a composite video frame, and add a new timestamp to the composite video frame.

For example, the second device may indicate that the video frame needs to participate in composition by carrying the device identifier in the acquisition request.

For example, the second device may carry the device identifiers of the first device, the third device, and the fourth device in the acquisition request, and when receiving the acquisition request, the target device may synthesize the received video frames of the first device, the third device, and the fourth device, that is, decode the received video frames of the first device, the third device, and the fourth device, respectively, then perform synthesis encoding, and package the synthesized frames, generate a new video frame (synthesized video frame), and stamp a new timestamp.

Considering that the time stamp of the synthesized frame changes when the relay device synthesizes and codes the audio and video data, if lip synchronization is still performed according to the video synchronization parameter pair and the audio synchronization parameter pair provided by the source device of the audio and video data (the video conference device providing the audio and video data), the lip synchronization effect cannot be guaranteed, so that in a scene where the synthesis and coding exist (the video data synthesis and coding or/and the audio data synthesis and coding), the relay device can update the video synchronization parameter pair or/and the audio synchronization parameter pair provided by the source device of the audio and video data according to the new time stamp obtained after the synthesis and coding.

In addition, the absolute time corresponding to the difference of the new timestamp printed when the relay device performs the synthesis coding on the audio and video data acquired by the same device at different moments is considered to be generally consistent with the difference of the acquisition moments of the audio and video data.

For example, assuming that the audio data a1 and a2 are captured by the device a at the time T1 and the time T2, and the timestamp that is newly stamped by the relay device (for example, the device B) when the audio data a1 is subjected to the composite encoding is TS3, and the timestamp that is newly stamped by the relay device (for example, the device B) when the audio data a2 is subjected to the composite encoding is TS4, the absolute time corresponding to the difference between TS4 and TS3 is generally consistent with the difference between a2 and a 1.

It should be noted that, in this embodiment of the present application, for each video conference device participating in the composite coding, the relay device may update the video synchronization parameter pair and the audio synchronization parameter pair provided by the relay device in the above manner, and when the audio/video data playing device needs to perform lip synchronization, lip synchronization may be performed on the audio/video data provided by the corresponding video conference device according to the updated video synchronization parameter pair and the audio synchronization parameter pair sent by the relay device.

For example, assuming that a relay device (e.g., device B) performs composite encoding on video data of device a and device C, device B may update a video synchronization parameter pair provided by device a (replacing a timestamp in the video synchronization parameter pair of device a with a new timestamp printed after the composite encoding), and update a video synchronization parameter pair provided by device C replacing a timestamp in the video synchronization parameter pair of device a with the new timestamp printed after the composite encoding), respectively.

When the device D receives the synthesized video frame provided by the device B, if lip synchronization needs to be performed on the audio and video data of the device a, the acquisition time of the video data of the device a may be determined based on the updated pair of video synchronization parameters of the device a; similarly, if lip synchronization needs to be performed on the audio and video data of the device C, the acquisition time of the video data of the device C may be determined based on the updated video synchronization parameter pair of the device C.

Accordingly, in this example, after the target device stamps a new timestamp for the composite video frame, the timestamp included in the video synchronization parameter of the first device may be replaced by the new timestamp, and the video synchronization parameter pair of the first device after the timestamp replacement may be sent to the second device.

In an example, the transmitting the audio frame of the first device to the second device and the pair of video synchronization parameters of the first device to the second device may include:

when the acquisition request comprises a forwarding instruction for an audio frame of the first device, sending the audio frame of the first device to the second device, and sending a video synchronization parameter pair of the first device to the second device;

when the acquisition request comprises a synthesis instruction aiming at the audio frame of the first device, synthesizing the audio frame of the first device with other audio frames participating in synthesis to generate a synthesized audio frame, and adding a new time stamp for the synthesized audio frame;

and sending the synthesized audio frame to the second device, replacing the timestamp included in the video synchronization parameter of the first device with a new timestamp, and sending the video synchronization parameter of the first device after the timestamp replacement to the second device.

For example, the specific implementation of the target device for synthesizing the audio frame may refer to the above description related to synthesizing the video frame, which is not described in this embodiment of the present application again.

Referring to fig. 3, as a possible implementation manner, the lip synchronization method may further include the following steps:

and step S300, collecting video data and audio data.

Step S310, generating a video frame based on the collected video data, and generating a video synchronization parameter pair based on the collection time of the video data and the timestamp of the video frame; and generating an audio frame based on the acquired audio data, and generating an audio synchronization parameter pair based on the acquisition time of the audio data and the time stamp of the audio frame.

Step S320, sending the video frame and the audio frame; and transmitting the video synchronization parameter pair and the audio synchronization parameter pair.

In this embodiment, the target device may also serve as a device for providing audio and video data to other video conference devices, in addition to the request device serving as audio and video data described in fig. 1 and the relay device serving as audio and video data described in fig. 2.

Illustratively, the target device may capture video data and audio data in real-time.

Take the processing of the captured video data by the target device as an example.

Illustratively, the target device may encode and encapsulate the captured video data and timestamp the captured video data to generate video frames.

The target device may generate a video synchronization parameter pair including a correspondence of the capture time and a timestamp based on the capture time of the video data and the timestamp in a video frame generated based on the video data.

Furthermore, the target device may send the video frame to other video conference devices on the one hand, and may send the video synchronization parameter pair to other video conference devices on the other hand.

It should be noted that, the target device does not need to generate a corresponding video synchronization parameter pair each time a video frame is generated, but may periodically generate a video synchronization parameter pair and periodically transmit the video synchronization parameter pair to other video conference devices.

For example, the period of sending the video synchronization parameter pair by the target device may be the same or different.

In one example, the generating a video frame based on the collected video data and generating a video synchronization parameter pair based on the collection time of the video data and a timestamp of the video frame; and generating an audio frame based on the acquired audio data, and generating an audio synchronization parameter pair based on the acquisition time of the audio data and the time stamp of the audio frame, which may include:

periodically generating video frames based on the acquired video data, and generating video synchronization parameter pairs based on the acquisition time of the video data and the time stamps of the video frames; and periodically generating an audio frame based on the collected audio data, and generating an audio synchronization parameter pair based on the collection time of the audio data and the time stamp of the audio frame.

For example, the period of generating the video synchronization parameter pair and the period of generating the audio synchronization parameter pair by the target device may be the same or different.

In order to enable those skilled in the art to better understand the technical solutions provided in the embodiments of the present application, the following describes the technical solutions provided in the embodiments of the present application with reference to specific application scenarios.

Referring to fig. 4, an architecture schematic diagram of a video conference system provided in an embodiment of the present application is shown in fig. 4, where the video conference system includes a video conference device 1, a video conference device 2, and a video conference device 3 (the video conference system may further include other video conference devices not shown in fig. 4), and the video conference device 1, the video conference device 2, and the video conference device 3 are sequentially connected in a cascade manner.

In this embodiment, the example is that the video conference device 3 requests the audio and video data of the video conference 1.

First, audio and video frame synthesis scene

Assuming that a synthesis instruction is carried in an acquisition request for a video frame and an audio frame of the video conference device 1, which is sent by the video conference device 3, and the request is to synthesize the video frame of the video conference device 1 and a video frame of the video conference device 4 (not shown in fig. 4), and synthesize the video frame of the audio conference device 1 and an audio frame of the audio conference device 4, a specific implementation flow is as follows:

001. the video conference device 1 joins the video conference managed by the video conference device 2, collects audio data and video data in real time, encodes and encapsulates the audio data and the video data, and sends the audio data and the video data to the video conference device 2 through a network after a timestamp is printed on the audio data and the video data.

For example, assuming that the capture time of the video frame VD10 is NTP _ V10 and the corresponding timestamp is TS _ V10, the video conference device 1 sends the video synchronization parameter pair including the correspondence between NTP _ V10 and TS _ V10 to the video conference device 2 through signaling.

Illustratively, the video conference device need not send this information for each video frame to device 2, but rather send it periodically, such as once every 1 second or 2 seconds.

For example, to improve the compatibility of the scheme, the video conference device 1 may send the video synchronization parameter pair through an RTP (Real-time Transport Protocol) Protocol or an RTCP (Real-time Transport Control Protocol) Protocol.

The video synchronization parameter pair is transmitted, for example, through SR (Sender Report) of RTCP protocol.

Similarly, assuming that the acquisition time of the audio frame AD10 is NTP _ a10 and the corresponding timestamp is TS _ a10, the video conference device 1 may send the audio synchronization parameter pair including the correspondence between NTP _ a10 and TS _ a10 to the video conference device 2 through signaling.

Illustratively, the video conference device 1 transmits the video synchronization parameter pairs at the same period as the audio synchronization parameter pairs.

002. When receiving the video frame VD10 sent by the video conference device 1, the video conference device 2 decodes the video frame VD10, synthesizes and encodes the video frame VD10 with the video frame (decoded) sent by the video conference device 4, repacks the video frame VD, generates a synthesized video frame VD11, and stamps a new time stamp TS _ V11.

Illustratively, when generating the composite video frame VD11, the video conference device 2 may record a change of a timestamp after the video frame VD10 generates the composite video frame VD11, and update the correspondence between the NTP _ V10 and the TS _ V10 included in the pair of video synchronization parameters of the video conference device 1 to the correspondence between the NTP _ V10 and the TS _ V11.

003. Similarly, when receiving the audio frame AD10 sent by the video conference device 1, the video conference device 2 decodes the audio frame AD10, synthesizes and encodes the audio frame AD10 with the audio frame (decoded) sent by the video conference device 4, repacks the audio frame AD11, and stamps a new time stamp TS _ a 11.

Illustratively, when generating the synthesized audio frame AD11, the video conference device 2 may record a change in a time stamp after the audio frame AD10 generates the synthesized audio frame AD11, and update the correspondence between the NTP _ a10 and the TS _ a10 included in the pair of audio synchronization parameters of the video conference device 1 to the correspondence between the NTP _ a10 and the TS _ a 11.

The video conference device 2 may send the composite video frame VD11 and the composite audio frame AD11 to the video conference device 3, and send the updated video synchronization parameter pair and audio synchronization parameter pair to the video conference device 3, which may be schematically illustrated in fig. 5A.

004. When receiving the synthesized video frame VD11 and the synthesized audio frame AD11, the video conference device 3 determines the capturing time of the video frame and the capturing time of the audio frame of the video conference device 1 according to the video synchronization parameter pair (NTP _ V10, TSV _11) and the audio synchronization parameter pair (NTP _ a10, TSA _11) of the device 1 that are received most recently.

For example, if the video conference device 3 receives the video synchronization parameter pair (NTP _ V10, TSV _11) for the video conference device 1 most recently, and the timestamp of the currently received video frame of the video conference device 1 is TSV _1N, the capture time of the currently received video frame of the video conference device 1 is:

NTP_V＝NTP_V10+1000*(TSV1N-TSV11)/VIDEO_SYSTEM_CLOCK

similarly, if the audio synchronization parameter pair (NTP _ a10, TSA _11) for the video conference device 1 that is received by the video conference device 3 most recently and the timestamp of the currently received audio frame of the video conference device 1 is TSA _1N, the capture time of the currently received audio frame of the video conference device 1 is:

NTP_A＝NTP_A10+1000*(TSA1N-TSA11)/AUDIO_SYSTEM_CLOCK

005. the video conference device 3 dynamically adjusts the size of the video frame buffer space (VideoBuff) and the size of the audio frame buffer space (AudioBuff) for the video conference device 1 based on the sizes of the video frame collection time (assumed to be NTP _ a) and the audio frame collection time (assumed to be NTP _ V) of the video conference device 1, and the schematic diagram thereof may be as shown in fig. 5B; when synthesizing and rendering the audio and video of the video conference device 1, it is necessary to ensure that the NTP _ V of the video frame and the NTP _ a of the audio frame participating in the synthesizing and rendering are consistent, so as to implement lip sound synchronization.

Second, audio and video frame forwarding scene

Assuming that a forwarding instruction is carried in an acquisition request for a video frame and an audio frame of the video conference device 1 sent by the video conference device 3, the video frame and the audio frame of the video conference device 1 are requested, and the specific implementation flow is as follows:

006. the same as 001.

007. When receiving the video frame VD10 sent by the video conference device 1, the video conference device 2 sends the video frame VD10 to the video conference device 3; upon receiving the video synchronization parameter pair (NTP _ V10, TS _ V10) transmitted by the video conference device 1, the video conference device 2 transmits the video synchronization parameter pair to the video conference device 3.

008. Similarly, when receiving the audio frame AD10 sent by the video conference device 1, the video conference device 2 sends the audio frame AD10 to the video conference device 3; upon receiving the audio synchronization parameter pair (NTP _ a10, TS _ a10) transmitted by the video conference device 1, the video conference device 2 transmits the audio synchronization parameter pair to the video conference device 3.

As an example, a schematic diagram of the video conference device 2 forwarding the video frames and the audio frames of the video conference device 1 may be as shown in fig. 5C.

009. When receiving the video frame VD10 and the audio frame AD10, the video conference device 3 determines the capturing time of the video frame and the capturing time of the audio frame of the video conference device 1 according to the video synchronization parameter pair (NTP _ V10, TSV _10) and the audio synchronization parameter pair (NTP _ a10, TSA _10) of the device 1 that are received most recently.

010. The same as 005.

Video frame synthesis and audio frame forwarding scene

Supposing that a synthesis instruction is carried in an acquisition request for a video frame of the video conference device 1 sent by the video conference device 3, the synthesis request is used for synthesizing the video frame of the video conference device 1 and the video frame of the video conference device 4; the specific implementation flow for requesting the audio frame of the video conference device 1 by carrying the forwarding instruction in the acquisition request of the audio frame of the video conference device 1 is as follows:

011. the same as 001.

012. The same as 002.

013. The same as 008.

014. When receiving the composite video frame VD11 and the audio frame AD10, the video conference device 3 determines the capturing time of the video frame and the capturing time of the audio frame of the video conference device 1 according to the video synchronization parameter pair (NTP _ V10, TSV _11) and the audio synchronization parameter pair (NTP _ a10, TSA _10) of the device 1 that are received most recently.

For example, a schematic diagram of the video conference device 2 synthesizing video frames of the video conference device 1 and forwarding audio frames of the video conference device 1 may be as shown in fig. 5D.

015. The same as 005.

It should be noted that for the lip sound synchronization process in the scenes of audio frame synthesis and video frame forwarding, reference may be made to the relevant description in the above process, and details are not repeated herein in the embodiments of the present application.

In the embodiment of the application, a video frame and an audio frame of a first device are received; receiving a video synchronization parameter pair and an audio synchronization parameter pair of the first device, and determining the acquisition time of a video frame of the first device based on the video synchronization parameter pair of the first device; determining the acquisition time of the audio frame of the first equipment based on the audio synchronization parameter pair of the first equipment; furthermore, the video frame and the audio frame of the first device are synchronized based on the acquisition time of the video frame and the acquisition time of the audio frame of the first device, so that lip sound synchronization in the video conference is realized.

The methods provided herein are described above. The following describes the apparatus provided in the present application:

referring to fig. 6, which is a schematic structural diagram of a lip sound synchronizer provided in an embodiment of the present application, as shown in fig. 6, the lip sound synchronizer may include:

a receiving unit 610 for receiving a video frame and an audio frame of a first device; and receiving a video synchronization parameter pair and an audio synchronization parameter pair of the first device; the video synchronization parameter pair comprises a corresponding relation between the acquisition time of video data and a time stamp of a video frame, and the audio synchronization parameter pair comprises a corresponding relation between the acquisition time of audio data and a time stamp of an audio frame;

a determining unit 620, configured to determine, based on the video synchronization parameter pair of the first device, a capture time of a video frame of the first device; determining the acquisition time of the audio frame of the first equipment based on the audio synchronization parameter pair of the first equipment;

a processing unit 630, configured to synchronize the video frame and the audio frame of the first device based on the capturing time of the video frame and the capturing time of the audio frame of the first device.

As a possible implementation manner, the determining unit 620 is specifically configured to, for a received target video frame of the first device, determine a capture time of the target video frame according to the following formula:

NTP_VN＝NTP_V0+1000*(TS_VN-TS_V0)/VIDEO_SYSTEM_CLOCK

wherein NTP _ V0 is a capture time included in the VIDEO synchronization parameter pair of the first device, TS _ V0 is a timestamp included in the VIDEO synchronization parameter pair of the first device, NTP _ VN is a capture time of the target VIDEO frame, TS _ VN is a timestamp of the target VIDEO frame, and VIDEO _ SYSTEM _ CLOCK is a VIDEO sampling frequency of the first device;

and/or the first and/or second light sources,

the determining unit 620 is specifically configured to determine, for a received target audio frame of the first device, a capture time of the target audio frame according to the following formula:

NTP_AN＝NTP_A0+1000*(TS_AN-TS_A0)/AUDIO_SYSTEM_CLOCK

As a possible implementation manner, the processing unit 630 is specifically configured to adjust the size of the video frame buffer space and the size of the audio frame buffer space for the first device according to the capturing time of the video frame and the capturing time of the audio frame of the first device;

and caching the video frame and the audio frame of the first equipment so as to ensure that the video frame and the audio frame of the first equipment participating in synthesis and rendering are consistent in acquisition time when the video frame and the audio frame of the first equipment are synthesized and rendered.

As a possible implementation, as shown in fig. 7, the apparatus further includes:

a first sending unit 640, configured to send a video frame and an audio frame of a first device to a second device when the receiving unit 610 receives an acquisition request, sent by the second device, for the video frame and the audio frame of the first device; and sending the video synchronization parameter pair and the audio synchronization parameter pair of the first device to the second device, so that the second device synchronizes the video frame and the audio frame of the first device based on the video synchronization parameter pair and the audio synchronization parameter pair of the first device.

As a possible implementation manner, the first sending unit 640 is specifically configured to, when the obtaining request includes a forwarding instruction for a video frame of the first device, send the video frame of the first device to the second device, and send a video synchronization parameter pair of the first device to the second device;

the processing unit 630 is further configured to, when the obtaining request includes a composition instruction for the video frame of the first device, combine the video frame of the first device with other video frames participating in composition, generate a composite video frame, and add a new timestamp to the composite video frame; sending the composite video frame to the second device, and replacing a timestamp included in the video synchronization parameter of the first device with the new timestamp;

the first sending unit 640 is further configured to send the video synchronization parameter of the first device after the timestamp replacement to the second device.

As a possible implementation manner, the first sending unit 640 is specifically configured to send, when the obtaining request includes a forwarding instruction for an audio frame of the first device, the audio frame of the first device to the second device, and send a video synchronization parameter pair of the first device to the second device;

the processing unit 630 is further configured to, when the obtaining request includes a synthesis instruction for the audio frame of the first device, synthesize the audio frame of the first device with other audio frames participating in synthesis, generate a synthesized audio frame, and add a new timestamp to the synthesized audio frame; sending the synthesized audio frame to the second device, and replacing a timestamp included in the video synchronization parameter of the first device with the new timestamp;

As a possible implementation, as shown in fig. 8, the apparatus further includes:

a collecting unit 650 for collecting video data and audio data;

a generating unit 660, configured to generate a video frame based on the acquired video data, and generate a video synchronization parameter pair based on the acquisition time of the video data and a timestamp of the video frame; generating an audio frame based on the collected audio data, and generating an audio synchronization parameter pair based on the collection time of the audio data and the time stamp of the audio frame;

a second transmitting unit 670 for transmitting the video frame and the audio frame; and transmitting the video synchronization parameter pair and the audio synchronization parameter pair.

As a possible implementation manner, the generating unit 660 is specifically configured to periodically generate a video frame based on the acquired video data, and generate a video synchronization parameter pair based on the acquisition time of the video data and a timestamp of the video frame; periodically generating an audio frame based on the acquired audio data, and generating an audio synchronization parameter pair based on the acquisition time of the audio data and the time stamp of the audio frame;

the second sending unit 670 is specifically configured to send the video synchronization parameter pair periodically, and send the audio synchronization parameter pair periodically.

Fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure. The electronic device may include a processor 901, a communication interface 902, a memory 903, and a communication bus 904. The processor 901, the communication interface 902, and the memory 903 communicate with each other via a communication bus 904. Wherein, the memory 903 is stored with a computer program; the processor 901 can execute the lip sound synchronization method described above by executing a program stored in the memory 903.

The memory 903 referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the memory 902 may be: RAM, volatile memory, non-volatile memory, flash memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, dvd, etc.), or similar storage medium, or a combination thereof.

An embodiment of the present application further provides a machine-readable storage medium, such as the memory 903 in fig. 9, storing a computer program, where the computer program is executable by the processor 901 in the electronic device shown in fig. 9 to implement the lip synchronization method described above.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A lip sound synchronization method is characterized by comprising the following steps:

receiving a video frame and an audio frame of a first device; and receiving a video synchronization parameter pair and an audio synchronization parameter pair of the first device; the video synchronization parameter pair comprises a corresponding relation between the acquisition time of video data and a time stamp of a video frame, and the audio synchronization parameter pair comprises a corresponding relation between the acquisition time of audio data and a time stamp of an audio frame; the video synchronization parameter is generated by the first device based on the acquisition time of the video data and the timestamp marked when the video data is encoded and encapsulated; the audio synchronization parameter is generated by a time stamp printed by the first device when the first device encodes and encapsulates the audio data based on the acquisition time of the audio data;

determining the acquisition time of the video frame of the first device based on the video synchronization parameter pair of the first device; determining the acquisition time of the audio frame of the first equipment based on the audio synchronization parameter pair of the first equipment;

2. The method of claim 1, wherein determining the capture time of the video frame of the first device based on the video synchronization parameter pair of the first device comprises:

for a received target video frame of a first device, determining a capture moment of the target video frame by the following formula:

NTP_VN＝NTP_V0+1000*(TS_VN-TS_V0)/VIDEO_SYSTEM_CLOCK

and/or the first and/or second light sources,

the determining the acquisition time of the audio frame of the first device based on the audio synchronization parameter pair of the first device includes:

for a received target audio frame of a first device, determining a capture time of the target audio frame by the following formula:

NTP_AN＝NTP_A0+1000*(TS_AN-TS_A0)/AUDIO_SYSTEM_CLOCK

3. The method of claim 1, wherein synchronizing the video frames and the audio frames of the first device based on the capturing time of the video frames and the capturing time of the audio frames of the first device comprises:

adjusting the size of a video frame buffer space and the size of an audio frame buffer space aiming at the first equipment according to the acquisition time of the video frame and the acquisition time of the audio frame of the first equipment;

and caching the video frame and the audio frame of the first equipment so as to ensure that the video frame and the audio frame of the first equipment are synthesized and rendered at the same time of the acquisition time of the video frame and the audio frame of the first equipment participating in synthesis and rendering.

4. The method according to any one of claims 1-3, further comprising:

when receiving an acquisition request aiming at a video frame and an audio frame of a first device, which is sent by a second device, sending the video frame and the audio frame of the first device to the second device; and sending the video synchronization parameter pair and the audio synchronization parameter pair of the first device to the second device, so that the second device synchronizes the video frame and the audio frame of the first device based on the video synchronization parameter pair and the audio synchronization parameter pair of the first device.

5. The method of claim 4, wherein sending the video frame of the first device to the second device and sending the video synchronization parameter pair of the first device to the second device comprises:

when the obtaining request comprises a forwarding instruction aiming at the video frame of the first device, sending the video frame of the first device to the second device, and sending the video synchronization parameter pair of the first device to the second device;

when the acquisition request comprises a synthesis instruction of the video frame of the first device, synthesizing the video frame of the first device with other video frames participating in synthesis to generate a synthesized video frame, and adding a new timestamp for the synthesized video frame;

and sending the synthesized video frame to the second device, replacing the timestamp included in the video synchronization parameter of the first device with the new timestamp, and sending the video synchronization parameter of the first device after the timestamp replacement to the second device.

6. The method of claim 4, wherein sending audio frames of the first device to the second device and sending video synchronization parameter pairs of the first device to the second device comprises:

and sending the synthesized audio frame to the second device, replacing the timestamp included in the video synchronization parameter of the first device with the new timestamp, and sending the video synchronization parameter of the first device after the timestamp replacement to the second device.

7. The method according to any one of claims 1-3, further comprising:

collecting video data and audio data;

generating a video frame based on the acquired video data, and generating a video synchronization parameter pair based on the acquisition time of the video data and the timestamp of the video frame; generating an audio frame based on the acquired audio data, and generating an audio synchronization parameter pair based on the acquisition time of the audio data and the time stamp of the audio frame;

transmitting the video frame and the audio frame; and transmitting the video synchronization parameter pair and the audio synchronization parameter pair.

8. The method of claim 7, wherein generating a video frame based on the captured video data and generating a video synchronization parameter pair based on the capture time of the video data and a timestamp of the video frame; generating an audio frame based on the collected audio data, and generating an audio synchronization parameter pair based on the collection time of the audio data and the time stamp of the audio frame;

periodically generating video frames based on the acquired video data, and generating video synchronization parameter pairs based on the acquisition time of the video data and the time stamps of the video frames; periodically generating an audio frame based on the acquired audio data, and generating an audio synchronization parameter pair based on the acquisition time of the audio data and the time stamp of the audio frame;

the sending the video synchronization parameter pair and the audio synchronization parameter pair includes:

the video synchronization parameter pair is transmitted periodically and the audio synchronization parameter pair is transmitted periodically.

9. A lip sound synchronizer, comprising:

a receiving unit for receiving a video frame and an audio frame of a first device; and receiving a video synchronization parameter pair and an audio synchronization parameter pair of the first device; the video synchronization parameter pair comprises a corresponding relation between the acquisition time of video data and a time stamp of a video frame, and the audio synchronization parameter pair comprises a corresponding relation between the acquisition time of audio data and a time stamp of an audio frame; the video synchronization parameter is generated by the first device based on the acquisition time of the video data and the timestamp marked when the video data is encoded and encapsulated; the audio synchronization parameter is generated by a time stamp printed by the first device when the first device encodes and encapsulates the audio data based on the acquisition time of the audio data;

the determining unit is used for determining the acquisition time of the video frame of the first equipment based on the video synchronization parameter pair of the first equipment; determining the acquisition time of the audio frame of the first device based on the audio synchronization parameter pair of the first device;

10. The apparatus of claim 9,

the determining unit is specifically configured to determine, for a received target video frame of the first device, a capture time of the target video frame according to the following formula:

NTP_VN＝NTP_V0+1000*(TS_VN-TS_V0)/VIDEO_SYSTEM_CLOCK

and/or the first and/or second light sources,

the determining unit is specifically configured to determine, for a received target audio frame of the first device, a capture time of the target audio frame according to the following formula:

NTP_AN＝NTP_A0+1000*(TS_AN-TS_A0)/AUDIO_SYSTEM_CLOCK

11. The apparatus of claim 9,

the processing unit is specifically configured to adjust a video frame buffer space size and an audio frame buffer space size for the first device according to the acquisition time of the video frame and the acquisition time of the audio frame of the first device;

12. The apparatus according to any one of claims 9-11, further comprising:

the first sending unit is used for sending the video frame and the audio frame of the first equipment to the second equipment when the receiving unit receives an acquisition request aiming at the video frame and the audio frame of the first equipment, which is sent by the second equipment; and sending the video synchronization parameter pair and the audio synchronization parameter pair of the first device to the second device, so that the second device synchronizes the video frame and the audio frame of the first device based on the video synchronization parameter pair and the audio synchronization parameter pair of the first device.

13. The apparatus of claim 12,

the first sending unit is specifically configured to send the video frame of the first device to the second device and send the video synchronization parameter pair of the first device to the second device when the obtaining request includes a forwarding instruction for the video frame of the first device;

the processing unit is further configured to, when the acquisition request includes a synthesis instruction for the video frame of the first device, synthesize the video frame of the first device with other video frames participating in synthesis to generate a synthesized video frame, and add a new timestamp to the synthesized video frame; sending the composite video frame to the second device, and replacing a timestamp included in the video synchronization parameter of the first device with the new timestamp;

the first sending unit is further configured to send the video synchronization parameter of the first device after the timestamp replacement to the second device.

14. The apparatus of claim 12,

the first sending unit is specifically configured to send the audio frame of the first device to the second device and send the video synchronization parameter pair of the first device to the second device when the acquisition request includes a forwarding instruction for the audio frame of the first device;

the processing unit is further configured to, when the acquisition request includes a synthesis instruction for the audio frame of the first device, synthesize the audio frame of the first device with other audio frames participating in synthesis, generate a synthesized audio frame, and add a new timestamp to the synthesized audio frame; sending the synthesized audio frame to the second device, and replacing a timestamp included in the video synchronization parameter of the first device with the new timestamp;

15. The apparatus according to any one of claims 9-11, further comprising:

the acquisition unit is used for acquiring video data and audio data;

the generating unit is used for generating a video frame based on the acquired video data and generating a video synchronization parameter pair based on the acquisition time of the video data and the timestamp of the video frame; generating an audio frame based on the acquired audio data, and generating an audio synchronization parameter pair based on the acquisition time of the audio data and the time stamp of the audio frame;

a second transmitting unit for transmitting the video frame and the audio frame; and transmitting the video synchronization parameter pair and the audio synchronization parameter pair.

16. The apparatus of claim 15,

the generating unit is specifically configured to periodically generate a video frame based on the acquired video data, and generate a video synchronization parameter pair based on the acquisition time of the video data and a timestamp of the video frame; periodically generating an audio frame based on the acquired audio data, and generating an audio synchronization parameter pair based on the acquisition time of the audio data and the time stamp of the audio frame;

the second sending unit is specifically configured to send the video synchronization parameter pair periodically, and send the audio synchronization parameter pair periodically.