CN116546251A

CN116546251A - Audio/video synchronization method, device, electronic equipment, medium and program product

Info

Publication number: CN116546251A
Application number: CN202310446047.XA
Authority: CN
Inventors: 沈磊; 章军海
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-08-04

Abstract

The embodiment of the specification discloses an audio and video synchronization method, an audio and video synchronization device, electronic equipment, a medium and a program product. Wherein the method comprises the following steps: acquiring audio and video data, wherein the audio and video data comprises an audio packet and a video packet; updating display time stamps corresponding to the audio packets and the video packets respectively by using the absolute time stamps; inserting the audio packet into the audio buffer area and inserting the video packet into the video buffer area according to a first preset rule; taking out the target audio packet from the audio buffer area according to a second preset rule; and fetching the corresponding target video packet from the video buffer based on the absolute timestamp of the target audio packet.

Description

Audio/video synchronization method, device, electronic equipment, medium and program product

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an audio and video synchronization method, an apparatus, an electronic device, a medium, and a program product.

Background

With the development of life, the requirements for real-time audio and video such as video chat and video conference are greatly developed. For faster transmission speeds, udp is generally used. However, since the udp transmission is unreliable, the audio and video data is contained and lost and out of order, so that the logic of packet loss retransmission and reordering is increased when the udp is used.

At present, the audio and video synchronization schemes of real-time audio and video mainly have two kinds: one is that the neteq module of webrtc is extracted to ensure that the audio is output at a stable frame rate, and then the video is aligned to the audio to realize real-time audio and video synchronization; and the other is to realize real-time audio and video synchronization by expanding the rtc protocol header field and adding an ntp time stamp and adopting absolute time stamp alignment at a receiving end.

Therefore, there is a need for an audio/video synchronization scheme that is simpler to implement and better in compatibility.

Disclosure of Invention

The embodiment of the specification provides an audio and video synchronization method, an audio and video synchronization device, an electronic device, a medium and a program product, which achieve the aim of audio and video synchronization by an audio buffer area and a video buffer area and a mode of aligning video to audio, so that the audio and video synchronization is simpler and has better compatibility. The technical scheme is as follows:

in a first aspect, an embodiment of the present disclosure provides an audio and video synchronization method, including:

acquiring audio and video data; the audio and video data comprise an audio packet and a video packet;

updating display time stamps corresponding to the audio packets and the video packets respectively by using absolute time stamps;

Inserting the audio packet into an audio buffer area and inserting the video packet into a video buffer area according to a first preset rule;

taking out the target audio packet from the audio buffer area according to a second preset rule;

and taking out the corresponding target video packet from the video buffer zone based on the absolute timestamp of the target audio packet.

In one possible implementation manner, before updating the display time stamps corresponding to the audio packets and the video packets with absolute time stamps, the method further includes:

acquiring an SR packet corresponding to the audio and video data;

the updating the display time stamp corresponding to each of the audio packet and the video packet by using the absolute time stamp includes:

and updating display time stamps corresponding to the audio packet and the video packet respectively by using absolute time stamps based on the SR packet.

In one possible implementation, the SR packet includes a first absolute timestamp and a first display timestamp corresponding to audio, and a second absolute timestamp and a second display timestamp corresponding to video;

the updating of the display time stamp corresponding to each of the audio packet and the video packet by using the absolute time stamp based on the SR packet includes:

Determining a first correspondence between an absolute timestamp corresponding to the audio packet and a display timestamp corresponding to the audio packet based on the first absolute timestamp and the first display timestamp corresponding to the audio, and determining a second correspondence between an absolute timestamp corresponding to the video packet and a display timestamp corresponding to the video packet based on the second absolute timestamp and the second display timestamp corresponding to the video;

and updating the display time stamp corresponding to the audio packet by using the absolute time stamp corresponding to the audio packet based on the first corresponding relation, and updating the display time stamp corresponding to the video packet by using the absolute time stamp corresponding to the video packet based on the second corresponding relation.

In one possible implementation manner, the inserting the audio packet into the audio buffer and the video packet into the video buffer according to the first preset rule includes:

inserting the data packet into the tail of the buffer zone corresponding to the data packet when the display time stamp of the data packet is larger than the display time stamp of the tail packet in the buffer zone corresponding to the data packet and no hole exists between the display time stamp of the data packet and the display time stamp of the tail packet;

Wherein, the data packet is the audio packet, and the buffer area corresponding to the data packet is an audio buffer area; or (b)

The data packet is the video packet, and the buffer area corresponding to the data packet is a video buffer area.

In one possible implementation manner, before the inserting the data packet into the buffer area corresponding to the data packet, the method further includes:

judging whether the display time stamp of the data packet is smaller than the display time stamp of the tail packet in the buffer area corresponding to the data packet;

if yes, deleting the data packet.

In one possible implementation manner, after the determining whether the display timestamp of the data packet is less than the display timestamp of the tail packet in the buffer corresponding to the data packet, the method further includes:

if not, judging whether a hole exists between the display time stamp of the data packet and the display time stamp of the tail packet in the buffer area corresponding to the data packet;

if yes, filling a corresponding target data packet in a buffer area corresponding to the data packet based on the cavity.

In one possible implementation, when the data packet is an audio packet, the target data packet is a mute packet;

when the data packet is a video packet, the target data packet is a tail packet in the video buffer area.

In one possible implementation manner, after the determining whether a hole exists between the display timestamp of the data packet and the display timestamp of the tail packet in the buffer corresponding to the data packet, the method further includes:

if not, removing the overtime packet in the buffer zone corresponding to the data packet; the overtime packet is a packet with a target time interval between the overtime packet and the tail packet in the corresponding buffer zone of the data packet being longer than the preset buffer time of the corresponding buffer zone of the data packet.

In one possible implementation manner, before the inserting the audio packet into the audio buffer and the inserting the video packet into the video buffer according to the first preset rule, the method further includes:

and inserting invalid packets corresponding to the preset buffer duration in the audio buffer.

In one possible implementation manner, the extracting the target audio packet from the audio buffer according to the second preset rule includes:

and taking out the target audio packet from the audio buffer at regular intervals based on the duration of the audio packet in the audio buffer.

In one possible implementation manner, the step of periodically fetching the target audio packet from the audio buffer at intervals based on the duration of the audio packet in the audio buffer includes:

Judging whether the packet taking time is reached at regular intervals;

if yes, judging whether the first packet of the first buffer queue in the audio buffer is valid or not under the condition that the buffer queue in the audio buffer is not empty;

if yes, judging whether the last fetched packet in the audio buffer is valid or not;

if not, taking out the first packet of the buffer queue head in the audio buffer area as the target audio packet under the condition that the buffer queue is full.

In one possible implementation manner, after the determining, at regular intervals, whether the packet taking time is reached, the method further includes:

and filling the buffer queue with the mute packet when the buffer queue in the audio buffer is empty.

In one possible implementation manner, after the determining whether the first packet of the buffer queue head in the audio buffer is valid, the method further includes:

if not, the first packet of the buffer queue head in the audio buffer is taken out.

In one possible implementation manner, after the determining whether the packet fetched in the audio buffer is valid, the method further includes:

if not, the buffer queue is filled with the mute packet when the buffer queue is not filled.

if yes, judging whether the first packet of the buffer queue head in the audio buffer area is continuous with the packet which is taken out from the last one in the audio buffer area;

if yes, taking the first packet of the buffer queue head in the audio buffer area as a target audio packet;

if not, filling the mute packet with corresponding duration before buffering the first packet of the queue head in the audio buffer.

In one possible implementation manner, the extracting the corresponding target video packet from the video buffer based on the absolute timestamp of the target audio packet includes:

judging whether the difference between the absolute time stamp corresponding to the first packet in the video buffer area and the absolute time stamp corresponding to the target audio packet is smaller than a threshold value;

if yes, taking the first packet in the video buffer area as a target video packet out of the video buffer area.

In a second aspect, embodiments of the present disclosure provide an audio/video synchronization device, including:

the first acquisition module is used for acquiring audio and video data; the audio and video data comprise an audio packet and a video packet;

The updating module is used for updating the display time stamp corresponding to each of the audio packet and the video packet by using the absolute time stamp;

the first inserting module is used for inserting the audio packets into the audio buffer area and inserting the video packets into the video buffer area according to a first preset rule;

the first packet taking module is used for taking out the target audio packet from the audio buffer area according to a second preset rule;

and the second packet fetching module is used for fetching the corresponding target video packet from the video buffer zone based on the absolute timestamp of the target audio packet.

In one possible implementation manner, the audio and video synchronization device further includes:

the second acquisition module is used for acquiring the SR packets corresponding to the audio and video data;

the updating module is specifically used for:

the update module includes:

a determining unit configured to determine a first correspondence between an absolute timestamp corresponding to the audio packet and a display timestamp corresponding to the audio packet based on the first absolute timestamp and the first display timestamp corresponding to the audio, and determine a second correspondence between an absolute timestamp corresponding to the video packet and a display timestamp corresponding to the video packet based on the second absolute timestamp and the second display timestamp corresponding to the video;

And the updating unit is used for updating the display time stamp corresponding to the audio packet by using the absolute time stamp corresponding to the audio packet based on the first corresponding relation, and updating the display time stamp corresponding to the video packet by using the absolute time stamp corresponding to the video packet based on the second corresponding relation.

In one possible implementation manner, the above insertion module is specifically configured to:

the first judging module is used for judging whether the display time stamp of the data packet is smaller than the display time stamp of the tail packet in the buffer area corresponding to the data packet;

And the first deleting module is used for deleting the data packet if yes.

the second judging module is used for judging whether a hole exists between the display time stamp of the data packet and the display time stamp of the tail packet in the buffer area corresponding to the data packet if not;

and the first filling module is used for filling the corresponding target data packet in the buffer area corresponding to the data packet based on the hole if the data packet is the target data packet.

the second deleting module is used for removing the overtime packet in the buffer area corresponding to the data packet if not; the overtime packet is a packet with a target time interval between the overtime packet and the tail packet in the corresponding buffer zone of the data packet being longer than the preset buffer time of the corresponding buffer zone of the data packet.

And the second inserting module is used for inserting invalid packets corresponding to the preset buffer duration in the audio buffer.

In one possible implementation manner, the first packet fetching module is specifically configured to:

In one possible implementation manner, the first packet fetching module includes:

the first judging unit is used for judging whether the packet taking time is reached or not at regular intervals;

the second judging unit is used for judging whether the first packet of the first buffer queue in the audio buffer is valid or not under the condition that the buffer queue in the audio buffer is not empty;

a third judging unit, configured to judge whether a packet fetched from the previous audio buffer is valid if yes;

and the first packet fetching unit is used for taking out the first packet of the buffer queue head in the audio buffer area as a target audio packet if the buffer queue is full.

and the second filling module is used for filling the buffer queue into the mute packet when the buffer queue in the audio buffer area is empty.

and the third packet fetching module is used for fetching the first packet of the buffer queue head in the audio buffer area if not.

and a third filling module, configured to fill the buffer queue with the mute packet if the buffer queue is not filled.

a third judging module, configured to judge whether a first packet of a buffer queue head in the audio buffer is continuous with a packet that is taken out from the previous audio buffer if the first packet is yes;

the fourth packet fetching module is used for fetching the first packet of the buffer queue head in the audio buffer area as a target audio packet if yes;

and a fourth filling module, configured to fill the mute packet of a corresponding duration before buffering the first packet of the queue head in the audio buffer if not.

In one possible implementation manner, the second packet fetching module includes:

a fourth judging unit, configured to judge whether a difference between an absolute timestamp corresponding to the first packet in the video buffer and an absolute timestamp corresponding to the target audio packet is smaller than a threshold;

And the second packet fetching unit is used for taking the first packet in the video buffer area as a target video packet from the video buffer area if the first packet is the target video packet.

In a third aspect, embodiments of the present disclosure provide an electronic device, including: a processor and a memory;

the processor is connected with the memory;

the memory is used for storing executable program codes;

the processor executes a program corresponding to the executable program code stored in the memory by reading the executable program code for performing the method provided by the first aspect of the embodiments of the present specification or any one of the possible implementations of the first aspect.

In a fourth aspect, embodiments of the present specification provide a computer storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to carry out the method provided by the first aspect of embodiments of the present specification or any one of the possible implementations of the first aspect.

In a fifth aspect, embodiments of the present specification provide a computer program product comprising instructions which, when run on a computer or a processor, cause the computer or the processor to perform the method of audio video synchronization provided by the first aspect of embodiments of the present specification or any one of the possible implementations of the first aspect.

After audio and video data including an audio packet and a video packet are acquired, the display time stamps corresponding to the audio packet and the video packet are updated by using absolute time stamps, then the audio packet is inserted into an audio buffer area and the video packet is inserted into a video buffer area according to a first preset rule, a target audio packet is taken out from the audio buffer area according to a second preset rule, and a corresponding target video packet is taken out from the video buffer area based on the absolute time stamps of the target audio packet, so that the aim of audio and video synchronization is achieved by means of aligning the audio buffer area with the video buffer area and the video to the audio, the audio and video synchronization is simpler, and better compatibility is achieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present description, the drawings that are required in the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present description, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic architecture diagram of an audio/video synchronization system according to an exemplary embodiment of the present disclosure;

Fig. 2 is a flowchart of an audio/video synchronization method according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating an implementation process for updating display time stamps corresponding to an audio packet and a video packet according to an exemplary embodiment of the present disclosure;

FIG. 4 is a flowchart of an embodiment of inserting audio packets into an audio buffer and inserting video packets into a video buffer according to an exemplary embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating a specific implementation of extracting a target audio packet from an audio buffer according to an exemplary embodiment of the present disclosure;

fig. 6 is a schematic diagram of an implementation flow of extracting a corresponding target video packet from a video buffer based on an absolute timestamp of the target audio packet according to an exemplary embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an audio/video synchronization device according to an exemplary embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification.

The terms first, second, third and the like in the description and in the claims and in the above drawings, are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

It should be noted that, information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals according to the embodiments of the present disclosure are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, audio-video data, absolute time stamps, and the like referred to in this specification are all acquired with sufficient authorization.

Next, please refer to fig. 1, which is a schematic diagram illustrating an audio/video synchronization system according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the audio/video synchronization system includes: a transmitting end 110 and a receiving end 120. Wherein:

the transmitting end 110 may include one or more terminals corresponding to users. User version software can be installed in the transmitting end 110 to realize functions of collecting audio and video data and the like. The transmitting end 110 may establish a data relationship with a network, and establish a data connection relationship with the receiving end 120 through the network, for example, send collected audio and video data to the receiving end 120. Any of the transmitting terminals 110 may be, but not limited to, a mobile phone, a tablet computer, a notebook computer, etc. with user software installed.

The receiving end 120 may be a server capable of providing various audio and video synchronization, and may receive audio and video data from the transmitting end 110 through a network, where the audio and video data includes audio packets and video packets. After the receiving end 120 obtains the audio data, the absolute time stamp may be used to update the display time stamp corresponding to each of the audio packet and the video packet in the audio data, then insert the audio packet into the audio buffer and insert the video packet into the video buffer according to the first preset rule, take out the target audio packet from the audio buffer according to the second preset rule, and take out the corresponding target video packet from the video buffer based on the absolute time stamp of the target audio packet. It is understood that the receiving end 120 may be, but is not limited to being, a hardware server, a virtual server, a cloud server, etc. The receiving terminal 120 is not limited to a server, but may be any terminal, and the embodiment of the present disclosure is not limited thereto.

Optionally, after the receiving end 120 takes out the target audio packet and the target video packet corresponding to the target audio packet, the target audio packet and the target video packet corresponding to the target audio packet may be rendered and played, or the target audio packet and the target video packet corresponding to the target audio packet may be sent to other devices for rendering and playing.

The network may be a medium that provides a communication link between the receiving end 120 and any one of the transmitting ends 110, or may be the internet that includes network devices and transmission media, but is not limited thereto. The transmission medium may be a wired link, such as, but not limited to, coaxial cable, fiber optic and digital subscriber lines (digital subscriber line, DSL), etc., or a wireless link, such as, but not limited to, wireless internet (wireless fidelity, WIFI), bluetooth, a mobile device network, etc.

It should be understood that the number of the transmitting end 110 and the receiving end 120 in the audio and video synchronization system shown in fig. 1 is merely an example, and in a specific implementation, any number of transmitting ends and receiving ends may be included in the audio and video synchronization system, which is not specifically limited in the embodiments of the present disclosure. For example, but not limited to, the sender 110 may be a sender cluster composed of multiple senders, and the receiver 120 may be a receiver cluster composed of multiple receivers.

Next, referring to fig. 1, taking the example that the receiving end 120 performs audio/video synchronization, the audio/video synchronization method provided in the embodiment of the present disclosure will be described. Referring to fig. 2, a flowchart of an audio/video synchronization method according to an exemplary embodiment of the present disclosure is shown. As shown in fig. 2, the audio/video synchronization method includes the following steps:

s202, acquiring audio and video data, wherein the audio and video data comprises an audio packet and a video packet.

Specifically, in various real-time streaming media scenes such as online video, video conference and live broadcast, the sending end only collects audio and video data and then can send the audio and video data to the receiving end through a network. After the receiving end obtains the audio and video data from the transmitting end through the network, the receiving end always needs to perform audio and video synchronization on the audio and video data, and smoothness and display effect of subsequent audio and video can be ensured.

S204, updating the display time stamp corresponding to each of the audio packet and the video packet by using the absolute time stamp.

Specifically, the absolute (NTP) timestamp may be a network time corresponding to when the sender collects the audio and video data, and is a number of seconds elapsed since 1 month, 1 day, 00:00:00 in 1900. The display time stamp may be a playing time corresponding to the audio/video data. In order to ensure the accuracy of the display time stamps corresponding to the audio packets and the video packets, after the audio and video data are acquired, the display time stamps corresponding to the audio packets and the video packets can be updated by using the absolute time stamps, so that the errors of the display time stamps corresponding to the audio packets and the video packets acquired by the transmitting end are corrected. The display time stamp may be a transmission time or a collection time corresponding to the audio/video data, which is not limited in the embodiment of the present disclosure. Before the display time stamp corresponding to each of the audio packet and the video packet is updated by using the absolute time stamp, the receiving end also obtains the SR packet corresponding to the audio/video data. When the display time stamps corresponding to the audio packet and the video packet are updated with the absolute time stamps, the display time stamps corresponding to the audio packet and the video packet can be updated with the absolute time stamps directly based on the SR packet.

Further, the SR packet includes a first absolute timestamp and a first display timestamp corresponding to audio, and a second absolute timestamp and a second display timestamp corresponding to video. The implementation process of updating the display time stamp corresponding to each of the audio packet and the video packet based on the SR packet by using the absolute time stamp is specifically shown in fig. 3, a first correspondence between the absolute time stamp corresponding to the audio packet and the display time stamp corresponding to the audio packet may be determined based on the first absolute time stamp corresponding to the audio packet and the first display time stamp, a second correspondence between the absolute time stamp corresponding to the video packet and the display time stamp corresponding to the video packet may be determined based on the second absolute time stamp corresponding to the video packet and the second display time stamp, and then the display time stamp corresponding to the audio packet may be updated based on the first correspondence by using the absolute time stamp corresponding to the audio packet, and the display time stamp corresponding to the video packet may be updated based on the second correspondence by using the absolute time stamp corresponding to the video packet.

S206, inserting the audio packet into the audio buffer and inserting the video packet into the video buffer according to the first preset rule.

Specifically, the receiving end creates an audio buffer and a video buffer, updates display time stamps corresponding to the audio packet and the video packet by using absolute time stamps, and inserts the audio packet into the audio buffer and inserts the video packet into the video buffer according to a first preset rule based on the updated display time stamps corresponding to the audio packet and the video packet.

Specifically, as shown in fig. 4, in S206, according to the first preset rule, the specific implementation process of inserting the audio packet into the audio buffer and inserting the video packet into the video buffer may include:

s402, judging whether the display time stamp of the data packet is smaller than the display time stamp of the tail packet in the buffer area corresponding to the data packet.

Specifically, after updating the display time stamp corresponding to each of the audio packet and the video packet by using the absolute time stamp, in order to ensure continuity of the display time stamp of each data packet buffered in the buffer area corresponding to the data packet, it may be first determined whether the display time stamp of the data packet is smaller than the display time stamp of the tail packet in the buffer area corresponding to the data packet. When the data packet is an audio packet, the buffer area corresponding to the data packet is an audio buffer area. When the data packet is a video packet, the buffer area corresponding to the data packet is a video buffer area.

S404, if yes, deleting the data packet.

Specifically, if the display time stamp of the data packet currently acquired by the receiving end is smaller than the display time stamp of the tail packet in the buffer area corresponding to the data packet, the data packet currently acquired can be considered to be the packet which needs to be displayed before the tail packet in the buffer area corresponding to the data packet, and the data packet currently acquired is expired and can be directly deleted.

S406, if not, judging whether a hole exists between the display time stamp of the data packet and the display time stamp of the tail packet in the buffer area corresponding to the data packet.

Specifically, in S402, after determining whether the display timestamp of the data packet is smaller than the display timestamp of the tail packet in the buffer area corresponding to the data packet, if the display timestamp of the data packet currently acquired by the receiving end is not smaller than the display timestamp of the tail packet in the buffer area corresponding to the data packet in the receiving end, in order to ensure continuity of the display timestamps of the data packets cached in the buffer area corresponding to the data packet, it may be further determined whether a hole exists between the display timestamp of the data packet and the display timestamp of the tail packet in the buffer area corresponding to the data packet, that is, whether the display timestamp of the data packet currently acquired by the receiving end is continuous with the display timestamp of the tail packet in the buffer area corresponding to the data packet.

And S408, if so, filling the corresponding target data packet in the buffer area corresponding to the data packet based on the hole.

Specifically, if a hole exists between the display timestamp of the data packet and the display timestamp of the tail packet in the buffer area corresponding to the data packet, it can be considered that the display timestamp of the data packet currently acquired by the receiving end is discontinuous with the display timestamp of the tail packet in the buffer area corresponding to the data packet, and then the target data packet with the corresponding duration can be directly filled according to the duration corresponding to the hole, that is, the time interval between the display timestamp of the currently acquired data packet and the display timestamp of the tail packet in the buffer area corresponding to the data packet, so as to ensure the continuity between the display timestamps of the data packets in the buffer area (video buffer area and audio buffer area).

Further, when the data packet is an audio packet, the target data packet is a mute packet, that is, when the audio packet is inserted into the audio buffer, if a hole exists between the display timestamp of the audio packet and the display timestamp of the tail packet in the audio buffer, the mute packet with corresponding duration can be directly filled according to the time interval between the currently acquired display timestamp of the audio packet and the display timestamp of the tail packet in the audio buffer, so that continuity of the display timestamps of all the audio packets cached in the audio buffer is ensured, and subsequent alignment of the video packet to the audio packet is facilitated to achieve the realization of audio-video synchronization. When the data packet is a video packet, the target data packet is a tail packet in the video buffer zone, that is, when the video packet is inserted into the video buffer zone, if a hole exists between the display time stamp of the video packet and the display time stamp of the tail packet in the video buffer zone, the tail packet in the video buffer zone with corresponding duration can be directly filled according to the time interval between the currently acquired display time stamp of the video packet and the display time stamp of the tail packet in the video buffer zone, so that the continuity of the display time stamp of each video packet buffered in the video buffer zone is ensured, and the subsequent alignment of the video packet to the audio packet is facilitated to achieve the realization of audio/video synchronization.

If not, in S406, after determining whether a hole exists between the display timestamp of the data packet and the display timestamp of the tail packet in the buffer area corresponding to the data packet, the method further includes:

s410, if not, removing the overtime packet in the buffer zone corresponding to the data packet.

Specifically, if there is no hole between the display timestamp of the data packet and the display timestamp of the tail packet in the buffer area corresponding to the data packet, the timeout packet in the buffer area corresponding to the data packet is removed. The overtime packet is a packet with a target time interval between the data packet corresponding to the buffer area and the tail packet being longer than a preset buffer time of the data packet corresponding to the buffer area. The target time interval is a time interval between a display time stamp of each data packet in the corresponding buffer area of the data packet and a display time stamp of a tail packet in the buffer area. The preset buffer duration may be, but is not limited to, 3S, 2S, etc.

It can be understood that, because the real-time streaming media transmission does not need to reserve audio and video data far too long, when the network is blocked and the audio and video data is not acquired for a long time, and then many abnormal conditions such as audio and video data are acquired suddenly, the expired timeout packets buffered in the video buffer area and the audio buffer area can be deleted directly, so that the problem that voice or video is not displayed in the current state for a long time before is avoided.

Next, please refer to fig. 4, as shown in fig. 4, if not, in S406, after determining whether there is a hole between the display timestamp of the data packet and the display timestamp of the tail packet in the buffer corresponding to the data packet, the method further includes:

and S412, if not, inserting the data packet into the tail of the buffer area corresponding to the data packet.

Specifically, when the display timestamp of the data packet is greater than the display timestamp of the tail packet in the buffer area corresponding to the data packet, and no hole exists between the display timestamp of the data packet and the display timestamp of the tail packet, the display timestamp of the data packet currently acquired by the receiving end can be considered to be continuous with the display timestamp of the tail packet in the buffer area corresponding to the data packet, and then the data packet can be directly inserted into the tail of the buffer area corresponding to the data packet. When the data packet is an audio packet, the buffer area corresponding to the data packet is an audio buffer area; when the data packet is a video packet, the buffer area corresponding to the data packet is a video buffer area.

Optionally, in order to ensure that each audio/video data acquired from the network can be taken out after a period of time in the buffer, it is ensured that the audio packets in the audio buffer and the video packets in the video buffer can be synchronized, and before the audio packets are inserted into the audio buffer according to the first preset rule, an invalid packet corresponding to the preset buffer duration in the audio buffer may also be inserted into the audio buffer.

Next, please refer to fig. 2, as shown in fig. 2, the audio/video synchronization method further includes:

s208, the target audio packet is taken out from the audio buffer according to a second preset rule.

Specifically, since each audio packet includes one frame of audio, and the corresponding duration of each frame of audio is a constant value, the above-mentioned extracting the target audio packet from the audio buffer according to the second preset rule may be based on the duration of the audio packet in the audio buffer, that is, setting a timer to equal intervals according to the duration of the audio frame to extract the target audio packet from the audio buffer. The time interval for taking the target audio packet from the audio buffer may be the duration of each audio frame or the duration of the target audio packet. In order to avoid the situation that the audio cannot be constantly fetched due to sudden network jamming, the time interval for fetching the target audio packet from the audio buffer may also be smaller than the duration of the audio frame.

It can be appreciated that in the real-time streaming scenario, the step S208 and the step S206 may be performed synchronously, that is, the receiving end inserts the currently acquired audio-video packet and the video packet into the corresponding buffers respectively, and also extracts the target audio packet from the audio buffer, which is not limited in this embodiment.

Specifically, as shown in fig. 5, the specific implementation process of extracting the target audio packet from the audio buffer according to the second preset rule in S208 above may include:

s502, judging whether the packet taking time is reached at regular intervals.

Specifically, a timer may be set to set the time interval of the audio frame to determine whether the packet capturing time is reached.

For example, if the duration of the audio frame is 20ms, in order to ensure the continuity of the audio, the packet fetching operation may be performed from the audio buffer area every 20ms, and then it may be determined whether the time of one packet fetching at the current distance reaches 20ms every 20ms, if so, it indicates that the packet fetching time is reached, and if not, it does not reach the packet fetching time.

S504, if the packet fetching time is reached, judging whether the buffer queue in the audio buffer is empty.

S506, in the case where the buffer queue in the audio buffer is empty, the buffer queue is filled with mute packets.

Specifically, if the packet capturing time is reached, in order to ensure constant output of the audio, in a case where the buffer queue in the audio buffer is empty, for example, in a case where the transmitting end in a mute state in the scene does not transmit the audio frame, since there may be transmission and playing of the video packet, the buffer queue needs to be filled with the mute packet, so as to ensure that the video frame can be continuously and stably output completely after the video is aligned to the audio.

S508, if the buffer queue in the audio buffer is not empty, judging whether the first packet of the buffer queue head in the audio buffer is valid.

Specifically, if the packet fetching time is reached and the buffer queue in the audio buffer is not empty, it may be further determined whether the first packet of the buffer queue head in the audio buffer is valid.

S510, if the first packet of the buffer queue head in the audio buffer is invalid, the first packet of the buffer queue head in the audio buffer is fetched.

Specifically, if the first packet of the buffer queue head in the audio buffer is not the audio packet acquired by the receiving end through the network, but is an invalid packet inserted in advance, the first packet of the buffer queue head in the audio buffer is considered invalid, and the invalid packet of the buffer queue head in the audio buffer can be directly taken out.

S512, if the first packet of the buffer queue head in the audio buffer area is valid, judging whether the last fetched packet in the audio buffer area is valid or not.

Specifically, if the first packet of the buffer queue head in the audio buffer is not the audio packet acquired by the receiving end through the network, the first packet of the buffer queue head in the audio buffer is considered to be valid, and whether the last extracted packet in the audio buffer is valid needs to be further determined.

S514, if the last fetched packet in the audio buffer is invalid, it is determined whether the buffer queue is full.

Specifically, if the first packet of the first buffer queue in the audio buffer is valid and the last packet taken out in the audio buffer is invalid, it is further determined whether the buffer queue is full.

S516, if the buffer queue is full, taking out the first packet of the buffer queue head in the audio buffer area as the target audio packet.

Specifically, if the first packet of the buffer queue head in the audio buffer is valid, and the packet extracted from the last one in the audio buffer is invalid and the buffer queue is full, it indicates that the first packet of the buffer queue head in the audio buffer is an audio packet obtained from the network, and the buffer queue in the audio buffer has no hole, the first packet of the buffer queue head in the audio buffer can be directly extracted as a target audio packet.

S518, if the buffer queue is not filled, the buffer queue is filled with mute packets.

Specifically, if the first packet of the first buffer queue in the audio buffer is valid, but the last packet taken out in the audio buffer is invalid, and the buffer queue is not filled, in order to avoid the effect that the audio frame is not stably taken out and the audio and video alignment is affected due to the existence of a hole in the buffer queue of the audio buffer, the buffer queue in the audio buffer may be filled with a mute packet.

Next, please refer to fig. 5, in S512, if the first packet of the buffer queue head in the audio buffer is valid, after determining whether the last fetched packet in the audio buffer is valid, the method further includes:

s520, if the last fetched packet in the audio buffer is valid, it is determined whether the first packet of the buffer queue head in the audio buffer is continuous with the last fetched packet in the audio buffer.

S522, if the first packet of the buffer queue head in the audio buffer is continuous with the last extracted packet in the audio buffer, the first packet of the buffer queue head in the audio buffer is extracted as the target audio packet.

Specifically, if the last extracted packet in the audio buffer is valid, and the first packet in the audio buffer that buffers the head of the queue is continuous with the last extracted packet in the audio buffer, which indicates that there is no hole between the first packet in the audio buffer that buffers the head of the queue and the last extracted packet in the audio buffer, the first packet in the audio buffer that buffers the head of the queue may be directly extracted as the target audio packet.

S524, if the first packet of the buffer queue head in the audio buffer area is discontinuous with the last extracted packet in the audio buffer area, filling the mute packet with the corresponding duration before the first packet of the buffer queue head in the audio buffer area.

Specifically, if the last extracted packet in the audio buffer is valid, but the first packet in the audio buffer that buffers the first packet in the queue is discontinuous with the last extracted packet in the audio buffer, it is indicated that there is a hole between the first packet in the audio buffer that buffers the first packet in the queue and the last extracted packet in the audio buffer, and in order to ensure the continuity of the extraction of the audio packet in the audio buffer, it is necessary to fill a mute packet of a corresponding duration before buffering the first packet in the audio buffer.

Next, please refer to fig. 2, as shown in fig. 2, after the target audio packet is fetched from the audio buffer according to the second preset rule in S208, the audio/video synchronization method further includes:

s210, the corresponding target video packet is fetched from the video buffer based on the absolute timestamp of the target audio packet.

Specifically, as shown in fig. 6, the implementation process of extracting the corresponding target video packet from the video buffer based on the absolute timestamp of the target audio packet may include:

s602, judging whether the difference between the absolute time stamp corresponding to the first packet in the video buffer area and the absolute time stamp corresponding to the target audio packet is smaller than a threshold value.

Specifically, after the target audio packet is taken out from the audio buffer according to the second preset rule, it may be first determined whether the difference between the absolute timestamp corresponding to the first packet in the video buffer and the absolute timestamp corresponding to the target audio packet is smaller than a threshold, i.e., whether the absolute timestamp corresponding to the first packet in the video buffer is synchronous with the absolute timestamp corresponding to the target audio packet.

And S604, if yes, taking the first packet in the video buffer area out of the video buffer area as a target video packet.

Specifically, if the difference between the absolute timestamp corresponding to the first packet in the video buffer and the absolute timestamp corresponding to the target audio packet is smaller than the threshold, it may be considered that the absolute timestamp corresponding to the first packet in the video buffer is synchronous with the absolute timestamp corresponding to the target audio packet, and the first packet in the video buffer may be taken out of the video buffer as the target video packet corresponding to the target audio packet, and the target audio packet and the target video packet corresponding to the target audio packet may be rendered and played, or sent to other terminals for rendering and playing, etc.

S606, if not, judging whether the absolute time stamp corresponding to the first packet in the video buffer is smaller than the absolute time stamp corresponding to the target audio packet.

Specifically, if the difference between the absolute timestamp corresponding to the first packet in the video buffer and the absolute timestamp corresponding to the target audio packet is not smaller than the threshold, it may be considered that the absolute timestamp corresponding to the first packet in the video buffer is not synchronous with the absolute timestamp corresponding to the target audio packet, and it is further required to determine whether the absolute timestamp corresponding to the first packet in the video buffer is smaller than the absolute timestamp corresponding to the target audio packet.

And S608, if the absolute time stamp corresponding to the first packet in the video buffer is smaller than the absolute time stamp corresponding to the target audio packet, deleting the first packet in the video buffer.

Specifically, if the difference between the absolute timestamp corresponding to the first packet in the video buffer and the absolute timestamp corresponding to the target audio packet is not smaller than the threshold, and the absolute timestamp corresponding to the first packet in the video buffer is smaller than the absolute timestamp corresponding to the target audio packet, the first packet in the video buffer may be considered as an early video packet, so as to avoid invalid display of video and ensure real-time synchronism of video and audio, the first packet in the video buffer needs to be deleted, and S602 is executed again, to determine whether the difference between the absolute timestamp corresponding to the first packet in the video buffer and the absolute timestamp corresponding to the target audio packet is smaller than the threshold.

And S610, if the absolute time stamp corresponding to the first packet in the video buffer is greater than the absolute time stamp corresponding to the target audio packet, taking the last packet of the first packet in the video buffer as the target video packet out of the video buffer.

Specifically, if the difference between the absolute timestamp corresponding to the first packet in the video buffer and the absolute timestamp corresponding to the target audio packet is not smaller than the threshold, and the absolute timestamp corresponding to the first packet in the video buffer is greater than the absolute timestamp corresponding to the target audio packet, the first packet in the video buffer may be considered to be the video packet corresponding to the synchronization of the audio packet after the target audio packet, and in order to ensure the synchronization of the subsequent audio and video, the last packet of the first packet in the video buffer may be taken out from the video buffer as the target video packet, i.e. the video frame is repeatedly played.

After the audio and video data including the audio packet and the video packet are acquired, the absolute time stamp is used to update the display time stamp corresponding to each of the audio packet and the video packet, then the audio packet is inserted into the audio buffer and the video packet is inserted into the video buffer according to the first preset rule, the target audio packet is taken out from the audio buffer according to the second preset rule, and the corresponding target video packet is taken out from the video buffer based on the absolute time stamp of the target audio packet, so that the aim of audio and video synchronization is achieved by aligning the audio buffer with the video buffer and the video to the audio, the audio and video synchronization is simpler, and the compatibility is better.

Next, please refer to fig. 7, which is a schematic diagram of an audio/video synchronization device according to an exemplary embodiment of the present disclosure. As shown in fig. 7, the audio/video synchronization device 700 includes:

a first obtaining module 710, configured to obtain audio and video data; the audio and video data comprise an audio packet and a video packet;

an updating module 720, configured to update display timestamps corresponding to the audio packets and the video packets with absolute timestamps;

A first inserting module 730, configured to insert the audio packet into an audio buffer and insert the video packet into a video buffer according to a first preset rule;

the first packet fetching module 740 is configured to fetch a target audio packet from the audio buffer according to a second preset rule;

the second packet fetching module 750 is configured to fetch the corresponding target video packet from the video buffer based on the absolute timestamp of the target audio packet.

In one possible implementation manner, the audio/video synchronization device 700 further includes:

the update module 720 is specifically configured to:

the update module 720 includes:

In one possible implementation manner, the first inserting module 730 is specifically configured to:

And the first deleting module is used for deleting the data packet if yes.

In one possible implementation manner, the first packet fetching module 740 is specifically configured to:

In one possible implementation manner, the first packet fetching module 740 includes:

In one possible implementation manner, the second packet fetching module 750 includes:

The division of each module in the above-mentioned audio and video synchronization device is only used for illustration, and in other embodiments, the audio and video synchronization device may be divided into different modules according to the need, so as to complete all or part of the functions of the above-mentioned audio and video synchronization device. The implementation of each module in the audio and video synchronization apparatus provided in the embodiments of the present specification may be in the form of a computer program. The computer program may run on a terminal or a server. Program modules of the computer program may be stored in the memory of the terminal or server. Which when executed by a processor, implements all or part of the steps of the audio-video synchronization method described in the embodiments of the present specification.

Next, please refer to fig. 8, which is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure. As shown in fig. 8, the electronic device 800 may include: at least one processor 810, at least one communication bus 820, a user interface 830, at least one network interface 840, a memory 850.

Wherein the communication bus 820 may be used to facilitate the coupled communication of the various components described above.

The user interface 830 may include a Display screen (Display) and a Camera (Camera), and the user interface 830 may also include a standard wired interface and a wireless interface.

The network interface 840 may optionally include, among other things, a bluetooth module, a near field communication (Near Field Communication, NFC) module, a wireless fidelity (Wireless Fidelity, wi-Fi) module, and the like.

Wherein processor 810 may include one or more processing cores. The processor 810 utilizes various interfaces and lines to connect various portions of the overall electronic device 800, perform various functions of the routing electronic device 800, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 850, and invoking data stored in the memory 850. Alternatively, the processor 810 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 810 may integrate one or a combination of several of a processor (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 810 and may be implemented on a single chip.

The Memory 850 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (ROM). Optionally, the memory 850 includes a non-transitory computer-readable medium. Memory 850 may be used to store instructions, programs, code, sets of codes, or instruction sets. The memory 850 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (e.g., an acquisition function, an update function, a packet fetch function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data or the like referred to in the above respective method embodiments. Memory 850 may also optionally be at least one storage device located remotely from the aforementioned processor 810. As shown in fig. 8, an operating system, network communication modules, user interface modules, and program instructions may be included in memory 850, which is a type of computer storage media.

In particular, the processor 810 may be configured to invoke the program instructions stored in the memory 850 and to perform in particular the following operations:

acquiring audio and video data; the audio and video data includes an audio packet and a video packet.

And updating the display time stamp corresponding to each of the audio packet and the video packet by using the absolute time stamp.

According to a first preset rule, the audio packet is inserted into an audio buffer area, and the video packet is inserted into a video buffer area.

And taking out the target audio packet from the audio buffer according to a second preset rule.

In some possible embodiments, before the processor 810 performs updating the display time stamps corresponding to the audio packets and the video packets with absolute time stamps, the method further includes:

and acquiring the SR packet corresponding to the audio and video data.

The processor 810 is specifically configured to, when executing updating the display time stamps corresponding to the audio packet and the video packet with absolute time stamps, execute:

In some possible embodiments, the SR packet includes a first absolute timestamp and a first display timestamp corresponding to audio, and a second absolute timestamp and a second display timestamp corresponding to video;

The processor 810 is specifically configured to, when executing updating display time stamps corresponding to the audio packet and the video packet with absolute time stamps based on the SR packet:

determining a first correspondence between an absolute timestamp corresponding to the audio packet and a display timestamp corresponding to the audio packet based on the first absolute timestamp corresponding to the audio and the first display timestamp, and determining a second correspondence between an absolute timestamp corresponding to the video packet and a display timestamp corresponding to the video packet based on the second absolute timestamp corresponding to the video and the second display timestamp.

In some possible embodiments, the processor 810 is configured to insert the audio packet into the audio buffer and the video packet into the video buffer according to a first preset rule, and is specifically configured to perform:

and inserting the data packet into the tail of the buffer area corresponding to the data packet when the display time stamp of the data packet is larger than the display time stamp of the tail packet in the buffer area corresponding to the data packet and no hole exists between the display time stamp of the data packet and the display time stamp of the tail packet.

In some possible embodiments, before the processor 810 performs inserting the data packet into a buffer corresponding to the data packet, the method further includes:

and judging whether the display time stamp of the data packet is smaller than the display time stamp of the tail packet in the buffer area corresponding to the data packet.

If yes, deleting the data packet.

In some possible embodiments, after the processor 810 performs determining whether the display timestamp of the data packet is less than the display timestamp of the tail packet in the buffer corresponding to the data packet, the method is further performed by:

if not, judging whether a hole exists between the display time stamp of the data packet and the display time stamp of the tail packet in the buffer area corresponding to the data packet.

In some possible embodiments, when the data packet is an audio packet, the target data packet is a mute packet;

In some possible embodiments, after the processor 810 performs determining whether there is a hole between the display timestamp of the data packet and the display timestamp of the tail packet in the buffer corresponding to the data packet, the method is further used to perform:

In some possible embodiments, before the processor 810 executes inserting the audio packet into the audio buffer and the video packet into the video buffer according to the first preset rule, the method further comprises:

In some possible embodiments, the processor 810 is specifically configured to, when executing the fetching of the target audio packet from the audio buffer according to the second preset rule:

In some possible embodiments, the processor 810 is configured to, when executing the process of periodically fetching the target audio packet from the audio buffer at intervals based on the duration of the audio packet in the audio buffer, execute the process specifically:

and judging whether the packet taking time is reached or not at regular intervals.

If yes, judging whether the first packet of the buffer queue head in the audio buffer is valid or not under the condition that the buffer queue in the audio buffer is not empty.

If yes, judging whether the last fetched packet in the audio buffer is valid.

In some possible embodiments, after the processor 810 performs the equidistant timing to determine whether the packet taking time is reached, the method is further used to perform:

In some possible embodiments, after the processor 810 performs determining whether the first packet of the buffer queue head in the audio buffer is valid, the method further includes:

In some possible embodiments, after the processor 810 performs the determining whether the packet fetched last in the audio buffer is valid, the method further comprises:

if yes, judging whether the first packet of the buffer queue head in the audio buffer area is continuous with the packet fetched from the last one in the audio buffer area.

If yes, taking the first packet of the buffer queue head in the audio buffer area as a target audio packet.

In some possible embodiments, the processor 810 is specifically configured to, when executing the fetching of the corresponding target video packet from the video buffer based on the absolute timestamp of the target audio packet:

and judging whether the difference between the absolute time stamp corresponding to the first packet in the video buffer area and the absolute time stamp corresponding to the target audio packet is smaller than a threshold value.

The present description also provides a computer-readable storage medium having instructions stored therein, which when executed on a computer or processor, cause the computer or processor to perform one or more steps of the above embodiments. The above-mentioned constituent modules of the audio-video synchronization apparatus may be stored in the above-mentioned computer-readable storage medium if implemented in the form of software functional units and sold or used as independent products.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product described above includes one or more computer instructions. When the computer program instructions described above are loaded and executed on a computer, the processes or functions described in accordance with the embodiments of the present specification are all or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (Digital Subscriber Line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a digital versatile Disk (Digital Versatile Disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program, which may be stored in a computer-readable storage medium, instructing relevant hardware, and which, when executed, may comprise the embodiment methods as described above. And the aforementioned storage medium includes: various media capable of storing program code, such as ROM, RAM, magnetic or optical disks. The technical features in the present examples and embodiments may be arbitrarily combined without conflict.

The above-described embodiments are merely preferred embodiments of the present disclosure, and do not limit the scope of the disclosure, and various modifications and improvements made by those skilled in the art to the technical solution of the disclosure should fall within the scope of protection defined by the claims.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims and description may be performed in an order different from that in the embodiments recited in the description and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Claims

1. An audio-video synchronization method, the method comprising:

taking out a target audio packet from the audio buffer area according to a second preset rule;

2. The method of claim 1, the method further comprising, prior to updating the respective display timestamps of the audio and video packets with absolute timestamps:

acquiring an SR packet corresponding to the audio and video data;

the updating the display time stamp corresponding to each of the audio packet and the video packet by using the absolute time stamp comprises the following steps:

3. The method of claim 2, the SR packet comprising a first absolute timestamp and a first display timestamp corresponding to audio, and a second absolute timestamp and a second display timestamp corresponding to video;

The updating the display time stamp corresponding to each of the audio packet and the video packet by using the absolute time stamp based on the SR packet comprises:

4. The method of claim 1, wherein inserting the audio packets into an audio buffer and inserting the video packets into a video buffer according to a first predetermined rule comprises:

The data packet is the audio packet, and the buffer area corresponding to the data packet is an audio buffer area; or (b)

5. The method of claim 4, wherein before inserting the data packet into the buffer corresponding to the data packet, the method further comprises:

if yes, deleting the data packet.

6. The method of claim 5, wherein after determining whether the display timestamp of the data packet is less than the display timestamp of the tail packet in the buffer corresponding to the data packet, the method further comprises:

if yes, filling a corresponding target data packet in a buffer area corresponding to the data packet based on the hole.

7. The method of claim 6, wherein when the data packet is an audio packet, the target data packet is a mute packet;

when the data packet is a video packet, the target data packet is a tail packet in the video buffer zone.

8. The method of claim 6, wherein after the determining whether a hole exists between the display timestamp of the data packet and the display timestamp of the tail packet in the buffer corresponding to the data packet, the method further comprises:

if not, removing the overtime packet in the buffer zone corresponding to the data packet; and the overtime packet is a packet with a target time interval between the overtime packet and the tail packet in the corresponding buffer zone of the data packet being longer than the preset buffer time of the corresponding buffer zone of the data packet.

9. The method of claim 1, wherein the inserting the audio packets into the audio buffer and the inserting the video packets into the video buffer are preceded by a first predetermined rule, the method further comprising:

and inserting invalid packets of which the audio buffer area corresponds to a preset buffer duration into the audio buffer area.

10. The method of any of claims 1-9, the retrieving the target audio packet from the audio buffer according to a second preset rule, comprising:

and regularly taking out the target audio packet from the medium interval in the audio buffer based on the duration of the audio packet in the audio buffer.

11. The method of claim 10, the periodically retrieving a target audio packet from the audio buffer at regular intervals based on a duration of an audio packet in the audio buffer, comprising:

Judging whether the packet taking time is reached at regular intervals;

if yes, judging whether a first packet of the first buffer queue in the audio buffer is valid or not under the condition that the buffer queue in the audio buffer is not empty;

if yes, judging whether the last fetched packet in the audio buffer area is valid or not;

if not, taking out the first packet of the buffer queue head in the audio buffer area as a target audio packet under the condition that the buffer queue is filled.

12. The method of claim 11, wherein the equally spaced timing determination is made whether a packet fetch time has been reached, the method further comprising:

13. The method of claim 11, after said determining whether the first packet of the buffer queue head in the audio buffer is valid, the method further comprising:

and if not, taking out the first packet of the buffer queue head in the audio buffer area.

14. The method of claim 11, after said determining whether a last fetched packet in the audio buffer is valid, the method further comprising:

if not, the buffer queue is filled with the mute packet under the condition that the buffer queue is not filled.

15. The method of claim 11, after said determining whether a last fetched packet in the audio buffer is valid, the method further comprising:

if yes, judging whether the first packet of the buffer queue head in the audio buffer area is continuous with the last extracted packet in the audio buffer area;

16. The method of claim 1, the retrieving the corresponding target video packet from the video buffer based on the absolute timestamp of the target audio packet, comprising:

and if so, taking the first packet in the video buffer area out of the video buffer area as a target video packet.

17. An audio video synchronization device, the device comprising:

the first inserting module is used for inserting the audio packet into the audio buffer area and inserting the video packet into the video buffer area according to a first preset rule;

the first packet taking module is used for taking out a target audio packet from the audio buffer area according to a second preset rule;

18. An electronic device, comprising: a processor and a memory;

the processor is connected with the memory;

the memory is used for storing executable program codes;

the processor runs a program corresponding to executable program code stored in the memory by reading the executable program code for performing the method according to any one of claims 1-16.

19. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any of claims 1-16.

20. A computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to perform the audio video synchronization method of any of claims 1-16.