WO2023236767A1

WO2023236767A1 - Audio and video processing method and apparatus, and storage medium

Info

Publication number: WO2023236767A1
Application number: PCT/CN2023/095554
Authority: WO
Inventors: 郑万鹏
Original assignee: 中兴通讯股份有限公司
Priority date: 2022-06-06
Filing date: 2023-05-22
Publication date: 2023-12-14
Also published as: CN117241080A

Abstract

Provided in the present application are an audio and video processing method and apparatus, and a storage medium. The audio and video processing method comprises: acquiring a video data frame and an audio data frame (S100); determining a video serial number jump value according to a timestamp serial number of the currently received video data frame and timestamp serial numbers of two previously received video data frames (S200); then, determining an expected video playback time by using the video serial number jump value (S300); on the basis of the same principle, determining an audio serial number jump value (S400), and determining an expected audio playback time (S500); and finally, completing audio and video synchronization by using the correlation between the expected audio playback time and the expected video playback time (S600).

Description

Audio and video processing method and device and storage medium

Cross-references to related applications

This application is filed based on a Chinese patent application with application number 202210631499.0 and a filing date of June 6, 2022, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application as a reference.

Technical field

Embodiments of the present application relate to, but are not limited to, the field of communication technology, and in particular, to an audio and video processing method, device, and storage medium thereof.

Background technique

There are many existing audio and video synchronization technologies. However, these methods generally use frame dropping or repeated frame methods to synchronize video streams and audio streams. In practical applications, you will encounter such a scenario: if another network connection is used, The camera's RTP (Real-time Transport Protocol, Real-time Transport Protocol) timestamp of the camera on one side changed at a certain moment due to restart or network jitter, but the time is continuous with the picture or sound before the jump. In this scenario, if you continue to use the existing method for audio and video synchronization, it will lead to repeated frames or dropped frames for a long time, seriously affecting the user's viewing experience.

Contents of the invention

Embodiments of the present application provide an audio and video processing method, device, and storage medium.

In the first aspect, embodiments of the present application provide an audio and video processing method. The audio and video processing method includes: obtaining a video data frame and an audio data frame; The timestamp sequence number of the video data frame received twice determines the video sequence number jump value; the video expected playback time is determined according to the video sequence number jump value, and the video expected playback time represents the playback time corresponding to the video data frame. ; Determine the audio sequence number jump value based on the timestamp sequence number of the audio data frame received this time and the timestamp sequence number of the audio data frame received twice before; determine the audio expected playback time based on the audio sequence number jump value , the audio expected play time represents the play time corresponding to the audio data frame, wherein the audio expected play time is consistent with the video expected play time; according to the audio expected play time and the video expected play time Perform synchronization processing on the audio data frame and the video data frame.

In a second aspect, embodiments of the present application also provide an audio and video processing device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program Implement the above audio and video processing method.

In a third aspect, embodiments of the present application also provide a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to execute the above audio and video processing method.

Description of the drawings

Figure 1 is a flow chart of an audio and video processing method provided by an embodiment of the present application;

Figure 2 is a flow chart for determining the jump value of a video sequence number provided by an embodiment of the present application;

Figure 3 is a flow chart for determining the audio sequence number jump value provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

It should be noted that although a logical sequence is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that in the flowchart. The terms "first", "second", etc. in the description, claims, and above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific sequence or sequence.

Embodiments of the present application provide an audio and video processing method, device, and storage medium. When receiving a video data frame, the timestamp sequence number of the video data frame can be used to determine the video sequence number jump threshold of the video data frame, thereby knowing the video continuous state to determine whether a jump occurs, and then the video sequence number jump threshold can be used as the basis for determining the expected video playback time of the video data frame received this time, to avoid long-term repeated frames or loss during the playback of video data frames. Frame situation, at the same time, it also allows the video data frame to be played accurately based on the expected playback time of the video; similarly, when the audio data frame is received, the timestamp sequence number of the audio data frame can be used to determine the audio of the audio data frame The serial number jump threshold is used to know the continuous status of the audio and determine whether a jump occurs. The audio serial number jump threshold can then be used as a basis for determining the expected audio playback time of the audio data frame received this time to avoid the playback of the audio data frame. When frames are repeated or dropped for a long time, the audio data frame can be played accurately based on the expected playback time of the audio. Finally, because the expected audio playback time and the expected video playback time are consistent, the received audio data frames and video data frames can accurately correspond in time, achieving synchronization of audio data and video data.

As shown in Figure 1, Figure 1 is a flow chart of an audio and video processing method provided by an embodiment of the present application.

As shown in Figure 1, the audio and video processing method includes step S100, step S200, step S300, step S400, step S500 and step S600.

Step S100: Obtain video data frames and audio data frames;

Step S200: Determine the video sequence number jump value based on the timestamp sequence number of the video data frame received this time and the timestamp sequence number of the video data frames received twice before;

Step S300: Determine the expected playback time of the video according to the jump value of the video sequence number. The expected playback time of the video represents the playback time corresponding to the video data frame;

Step S400: Determine the audio sequence number jump value based on the timestamp sequence number of the audio data frame received this time and the timestamp sequence number of the audio data frames received twice before;

Step S500: Determine the expected audio playback time according to the audio sequence number jump value. The expected audio playback time represents the playback time corresponding to the audio data frame, where the expected audio playback time is consistent with the expected video playback time;

Step S600: Synchronize the audio data frame and the video data frame according to the desired audio playback time and the desired video playback time.

In this embodiment, when a video data frame or audio data frame is received, the timestamp serial number carried by the video data frame or audio data frame can be obtained. Therefore, the time of the video data frame or audio data frame received this time can be directly used. Stamp the sequence number and the timestamp sequence number of the two previously received video data frames or audio data frames to determine the video sequence number jump value or audio sequence number Jump value, and then determine whether a timestamp sequence number jump occurs, and determine the expected video playback time or audio playback time based on the audio sequence number jump value to ensure that it does not take a long time due to large changes in the timestamp sequence number. Frame dropping or frame filling occurs. At the same time, because the expected video playback time and the expected audio playback time are in a consistent relationship in time, it only needs to ensure that each received video data frame corresponds to the expected video playback time, and each received audio data frame corresponds to the audio playback expectation. Time correspondence ensures that audio and video are synchronized.

In some embodiments, in streaming media transmission scenarios such as real-time conferences or live broadcasts, the received video data frames will carry timestamp serial numbers. At the same time, because the audio and video data sending end sends the continuity of the video data frames, then in normal In this case, the timestamp serial numbers of the video data frames should be continuous. If the timestamp serial numbers are discontinuous, it means there is a problem with the video stream transmission. Among them, the time stamp sequence number jump will cause the timestamp sequence number to change greatly. Therefore, it is necessary to use the timestamp sequence numbers of three consecutive received video data frames to calculate the video sequence number jump value, and analyze the video sequence number jump value. , so that it can be determined whether a jump has occurred, and then the video sequence number jump value is used to determine the expected video playback time in the jump state, so as to determine the expected video playback time corresponding to the video data frame and ensure that the video data Frames can be played on time.

In the same way, the audio data frames received by the audio and video data processing end will also carry the timestamp sequence number. At the same time, because of the continuity of the video data frames sent by the audio and video data sending end, under normal circumstances, the timestamp sequence number of the audio data frame It should also be continuous. If the timestamp serial numbers are discontinuous, it means there is a problem with the audio stream transmission. Among them, the time stamp sequence number jump will cause the timestamp sequence number to change greatly. Therefore, it is necessary to use the timestamp sequence numbers of three consecutive received audio data frames to calculate the audio sequence number jump value, and analyze the audio sequence number jump value. , so that it can be determined whether a jump has occurred, and then the audio serial number jump value is used to complete the determination of the expected audio playback time in the jump state, to determine the expected audio playback time corresponding to the audio data frame, and ensure that the audio data Frames can be played on time.

In order to better explain the audio and video synchronization principle of the audio and video processing method of this application, a brief description is provided here. Each video data frame received will correspond to an expected playback time of the video. After processing, the video data frame will be played on time at the expected playback time of the video, which is equivalent to corresponding settings at different moments on a timeline. For different video data frames, you only need to follow the set process to complete the playback of the video data frames. In the same way, different audio data frames can also be set correspondingly at different moments in a timeline. At this time, you only need to ensure that the expected playback time of the audio corresponds to the expected playback time of the video, so that the two can be guaranteed to correspond on the timeline. Therefore, , after determining the expected audio playback time of the first frame of audio data frame and the expected video playback time of the first frame of video data frame, the subsequent audio and video processing method only needs to process each frame of audio data frame or video data frame according to the present application. Complete the corresponding desired playback time settings.

As shown in Figure 2, Figure 2 is a flow chart for determining the jump value of a video sequence number provided by an embodiment of the present application. It illustrates step S200. Step S200 includes but is not limited to the steps: step S210, step S220 and step S230. ,

Step S210: Determine the first sequence number difference based on the timestamp sequence numbers of the two previously received video data frames;

Step S220: Determine the second sequence number difference based on the timestamp sequence number of the video data frame received this time and the timestamp sequence number of the video data frame received last time;

Step S220: Calculate the video sequence number jump value based on the second sequence number difference and the first sequence number difference.

The first two timestamp sequence numbers are introduced here to collaboratively determine the degree of timestamp error sequence number jumps. The timestamp sequence numbers of three received video data frames are used to determine the video sequence number jump value, that is, the difference between the first sequence number and the second The ratio of the sequence number difference can directly and effectively know the degree of jump of the timestamp sequence number, which facilitates subsequent judgment of whether a timestamp sequence number jump occurs.

In some embodiments, taking the timestamp serial numbers of three video data frames as x _k-1 , x _k , and x _k+1 as an example, the constraint formula for calculating the video serial number jump value u can refer to the following formula:

In some embodiments, step S300 is described. Step S300 includes but is not limited to the steps: when the video sequence number jump value is greater than the preset video sequence number jump threshold, the video corresponding to the last received video data frame is expected to be played. The first time interval is added to the time, the expected video playback time is determined, and the first video expected playback time is obtained, where the first time interval represents the time interval between two received video data frames.

Considering that the timestamp sequence number jumps to a larger value during the jump, the video sequence number jump threshold also needs to be set larger, so that it can be distinguished from the short-term loss of the timestamp sequence number caused by packet loss. After determining the video sequence number jump value of the video data frame received this time, it is necessary to judge the video sequence number jump value. When the video sequence number jump value is greater than the video sequence number jump threshold, it can be determined that a timestamp sequence number currently occurs. Jump, at this time, the timestamp sequence number is directly used to determine the expected playback time of the video, and it is easy for the expected playback time of the video to be discontinuous. In some embodiments, considering that the video data frame will be sent normally when the timestamp sequence number jumps, the expected playback time of the video data frame received this time can be determined based on the first time interval. Therefore, after determining that the timestamp serial number jump occurs, you only need to add a first time interval to the expected video playback time corresponding to the last received video data frame to determine the expected video playback time of the currently received video data frame. , at the same time, the expected playback time of the video will also be recorded as the expected playback time of the first video, which will be used to subsequently update the initial video timestamp number and the expected playback time of the initial video.

In some embodiments, the audio and video processing method also includes: when the video sequence number jump value is less than the video sequence number jump threshold, determine the video expected playback based on the initial video timestamp number, the initial video expected playback time, and the timestamp number of the video data frame. time to obtain the expected playback time of the second video, where the initial video timestamp sequence number is obtained based on the timestamp sequence number of the video data frame received this time; the initial video expected playback time is based on the expected video playback time of the video data frame received this time. And get.

When there is no jump in the timestamp number, the timestamp number will not change too much. During normal playback, the timestamp number is in a continuous state. When network fluctuations cause network packet loss and video data frames are missing, the missing timestamp number will It will not be too much. At this time, you can directly use the timestamp sequence number of the video data frame received this time to complete the determination of the expected playback time of the video. In some embodiments, if the video sequence number jump value corresponding to the video data frame received this time is less than the preset video sequence number jump threshold, it means that the video data frame received this time is in a normal playback state or the network has lost packets. status, there is no jump. At this time, the expected playback time of this frame of video can be quickly determined based on the preset initial video timestamp number, the initial video expected playback time, and the timestamp number of the video data frame received this time. In some embodiments, the playback time interval of every two video data frames is fixed, then you only need to determine the difference between the video data frame received this time and the initial video timestamp number, and then the initial video can be expected to be played. Based on the time, this time interval is used to determine the expected playback time of this video data frame.

In some embodiments, the difference between the timestamp number of the current video data frame and the initial video timestamp number can be directly used to determine the interval timestamp number, and then combined with the playback time interval between each two video data frames. Multiplication is performed to determine the video playback time difference from the initial video expected playback time. Finally, by adding the video playback time difference to the initial video expected playback time, the expected video playback time of the video data frame received this time can be determined. .

In some embodiments, the initial video timestamp number and the initial video expected playback time are obtained by the following steps: when the video data frame is the first video data frame received, determine the timestamp number of the first video data frame. is the initial video timestamp sequence number, and the expected video playback time corresponding to the first video data frame is determined as the initial video expected playback time.

After receiving the first frame of video data frame, initialization will begin. At this time, the timestamp of the first frame of video data frame is used. sequence number to complete the initial assignment of the initial video timestamp sequence number, and use the expected video playback time of the first frame of video data frame to complete the initial assignment of the expected playback time of the initial video.

In some embodiments, the audio and video processing method further includes:

When the video sequence number jump value is greater than the video sequence number jump threshold, the initial video timestamp sequence number is updated according to the timestamp sequence number of the video data frame received this time;

The initial video expected play time is updated according to the first video expected play time.

When the timestamp number of a video data frame jumps, the timestamp numbers of all subsequently received video data frames will be assigned according to the timestamp number after the jump. Therefore, the initial video corresponding to the first video data frame can no longer be used. The timestamp number and initial video expected play time are used to calculate subsequent expected video play time, and after each jump, the previous initial video timestamp number and initial video expected play time cannot be used again. In some embodiments, when a timestamp number jump occurs, the timestamp number of the video data frame received this time and the expected playback time of the first video are first determined, and then the timestamp number and the expected playback time of the first video are directly used. Time can complete the update of the initial video timestamp serial number and the initial video expected playback time. Subsequently, the expected video playback time is calculated based on the updated initial video timestamp serial number and the initial video expected playback time, thereby ensuring that the entire video Accuracy and smoothness of data frame playback on the timeline.

In some embodiments, when the video data frame is the received first frame of video data frame, the audio and video processing method further includes: when the time of receiving the first frame of video data frame is earlier than or equal to the time of receiving the first frame of audio data frame. time, and set the expected playback time of the video to the preset time value.

When initializing the initial video timestamp number and the initial video expected playback time, the initial video timestamp number can be obtained directly from the first video data frame, but the initial video expected playback time cannot be obtained directly. At this time, you can directly Define a preset time value as the starting time, which is the expected playback time of the initial video. In some embodiments, the preset time value can be directly determined as 0 seconds.

In some embodiments, when the video data frame is the received first frame of video data frame, the audio and video processing method further includes: when the time of receiving the first frame of video data frame is later than the time of receiving the first frame of audio data frame. , the time interval between receiving the first frame of video data frame and receiving the first frame of audio data frame is determined as the expected video playback time.

In actual operation, there may be a certain time interval between the first frame of video data frame and the first frame of audio data frame when they are sent for the first time due to various reasons. In this case, in order to continue to ensure that the audio data The corresponding relationship between frames and video data frames requires that the two maintain a fixed time interval initially. Therefore, when the video data frame is the first frame and is later than the first audio data frame, it cannot Directly determine the expected playback time of the video to 0 seconds, and a time delay of a time interval needs to be maintained.

In order to better explain the constraint relationship between the expected playback time of the video, the timestamp number, the initial video timestamp number, and the expected playback time of the initial video, you can refer to the following constraint formula:

In the formula, x _start is the initial video timestamp sequence number, x _k is the k-th video data frame, t _x is the time interval between two video data frames, abs(x _start ) is the expected play time of the initial video, 0 is is 0 seconds, and Δ is the time interval between the first video data frame and the first audio data frame.

When the video data frame is the first frame, equation (1) is used to calculate the expected playback time of the video, and the initial view can be determined. The video timestamp sequence number and the expected playback time of the initial video. When there is no jump in the video data frame transmission, use equation (2) to complete the determination of the expected video playback time. When the video data frame jumps, equation (3) is used to determine the expected video playback time, and the initial video timestamp sequence number and the initial video expected playback time are updated at the same time.

In some embodiments, the audio and video processing method also includes:

When no video data frame is received beyond the preset video frame filling time threshold, copy the last video data frame received in sequence;

The expected video playback time of the copied video data frame is determined according to the expected video playback time of the first time interval corresponding to the last received video data frame, where the first time interval represents the time interval between two received video data frames.

During video data transmission, if packet loss occurs, there may be no video data frame for a certain period of time. At this time, if the video sequence number jump value is determined before supplementing the frame, it will cause certain problems. Delay, then in this case, you can directly use the video frame filling time threshold to copy the previous frame to compensate for the lack of this video data frame. Every time the video frame filling time threshold is exceeded, no video is received. The data frame uses the previous frame in turn to complete the frame complement. Until the video data frame is received normally, the video data frame will be used to complete the determination of the expected playback time of the video.

As shown in Figure 3, Figure 3 is a flow chart for determining the audio sequence number jump value provided by an embodiment of the present application, and illustrates step S400. Step S400 includes but is not limited to the steps: step S410, step S420 and step S430. ,

Step S410: Determine the third sequence number difference based on the timestamp sequence numbers of the two previously received audio data frames;

Step S420: Determine the fourth sequence number difference based on the timestamp sequence number of the audio data frame received this time and the timestamp sequence number of the audio data frame received last time;

Step S430: Calculate the audio sequence number jump value based on the third sequence number difference and the fourth sequence number difference.

The first two timestamp serial numbers are introduced here to jointly realize the judgment of the degree of wrong sequence number jump of the timestamp. The timestamp serial numbers of the three received audio data frames are used to determine the audio sequence number jump value, that is, the difference between the third sequence number and the fourth The ratio of the serial number difference can directly and effectively know the degree of jump of the timestamp serial number, which facilitates subsequent determination of whether a timestamp serial number occurs. For the constraint formula for calculating the audio sequence number jump value, you can refer to the constraint formula for calculating the video sequence number jump value.

In some embodiments, step S600 is described. Step S600 includes but is not limited to the step of: when the audio sequence number jump value is greater than the preset audio sequence number jump threshold, the expected audio playback time corresponding to the last received audio data frame is A second time interval is added to determine the expected audio playback time to obtain the first expected audio playback time, where the second time interval represents the time interval between two receptions of audio data frames.

Considering that the timestamp sequence number jumps to a larger value during the jump, the audio sequence number jump threshold also needs to be set larger, so that it can be distinguished from the short-term loss of the timestamp sequence number caused by packet loss. After determining the audio sequence number jump value of the audio data frame received this time, it is necessary to judge the audio sequence number jump value. When the audio sequence number jump value is greater than the audio sequence number jump threshold, it can be determined that a timestamp sequence number currently occurs. jump, at this time, the timestamp sequence number is directly used to complete the determination of the expected audio playback time, which is prone to the situation where the expected audio playback time is discontinuous. In some embodiments, considering that the audio data frame will be sent normally when the timestamp sequence number jumps, the expected audio playback time of the audio data frame received this time can be determined based on the second time interval. Therefore, after determining that the timestamp serial number jump occurs, you only need to add a second time interval to the expected audio playback time corresponding to the last received audio data frame to determine the expected audio playback time of the currently received audio data frame. , at the same time, the expected audio playback time will also be recorded as the first audio expected playback time, which will be used to subsequently update the initial audio timestamp number and the initial audio expected playback time.

In some embodiments, the audio and video processing method further includes: when the audio sequence number jump value is less than the audio sequence number jump threshold, determine the audio period based on the initial audio timestamp number, the initial audio expected playback time, and the timestamp number of the audio data frame. Expect the playback time to obtain the second audio expected playback time, where the initial audio timestamp number is obtained based on the timestamp number of the audio data frame received this time, and the initial audio expected playback time is based on the audio expectation of the audio data frame received this time. obtained by playing time.

When there is no time stamp sequence number jump, the timestamp sequence number will not change too much. During normal playback, the timestamp sequence number is in a continuous state. When network fluctuations cause network packet loss and audio data frames are missing, the missing timestamp sequence number will It will not be too much. At this time, you can directly use the timestamp sequence number of the audio data frame received this time to complete the determination of the expected audio playback time. In some embodiments, if the audio sequence number jump value corresponding to the audio data frame received this time is less than the preset audio sequence number jump threshold, it means that the audio data frame received this time is in a normal playback state or the network packet is lost. status, there is no jump. At this time, the expected playback time of this frame of audio can be quickly determined based on the preset initial audio timestamp number, the initial expected audio playback time, and the timestamp number of the audio data frame received this time. In some embodiments, the playback time interval of every two audio data frames is fixed, so you only need to determine the difference between the audio data frame received this time and the initial audio timestamp sequence number, and then the initial audio data can be played when expected. Based on the time, this time interval is used to complete the determination of the expected audio playback time of this audio data frame.

In some embodiments, the difference between the timestamp number of this audio data frame and the initial audio timestamp number can be directly used to determine the timestamp number of the interval, and then combined with the playback time interval between each two audio data frames. Perform a multiplication operation to determine the audio playback time difference from the initial audio expected playback time. Finally, adding the audio playback time difference to the initial audio expected playback time can determine the audio expected playback time of the audio data frame received this time. .

In some embodiments, the initial audio timestamp number and the initial audio expected playback time are obtained by the following steps: when the audio data frame is the first received audio data frame, determine the timestamp number of the first audio data frame. is the initial audio timestamp sequence number, and the expected audio playback time corresponding to the first frame of audio data frame is determined as the initial expected audio playback time.

After receiving the first audio data frame, initialization will begin. At this time, the timestamp number of the first audio data frame is used to complete the initialization assignment of the initial audio timestamp number. The audio of the first audio data frame is used to complete the initialization assignment. The expected playback time completes the initial assignment of the expected playback time of the initial audio.

In some embodiments, the audio and video processing method further includes:

When the audio sequence number jump value is greater than the audio sequence number jump threshold, the initial audio timestamp sequence number is updated according to the timestamp sequence number of the audio data frame received this time;

The initial audio expected play time is updated according to the first audio expected play time.

When the timestamp number of an audio data frame jumps, the timestamp numbers of all subsequently received audio data frames will be assigned according to the timestamp number after the jump. Therefore, the initial audio corresponding to the first audio data frame can no longer be used. The timestamp number and the initial expected audio playback time are used for subsequent calculations of the expected audio playback time, and after each jump, the previous initial audio timestamp number and initial expected audio playback time cannot be used again. In some embodiments, when a timestamp number jump occurs, the timestamp number of the audio data frame received this time and the expected playback time of the first audio are first determined, and then the timestamp number and the expected playback time of the first audio are directly used. Time can complete the update of the initial audio timestamp serial number and the initial audio expected playback time, and then calculate the audio expected playback time based on the updated initial audio timestamp serial number and the initial audio expected playback time, thereby ensuring that the entire audio Accuracy and smoothness of data frame playback on the timeline.

In some embodiments, when the audio data frame is the first received audio data frame, the audio and video processing method further includes: when the time of receiving the first audio data frame is earlier than or equal to the time of receiving the first video data frame time to set the desired audio playback time to the preset time value.

When initializing the initial audio timestamp number and the initial audio expected playback time, the initial audio timestamp number can be obtained directly from the first audio data frame, but the initial audio expected playback time cannot be obtained directly. At this time, you can directly Define a preset time value as the starting time, which is the expected playback time of the initial audio. In some embodiments, the preset time value can be directly determined as 0 seconds.

In some embodiments, when the audio data frame is the first received audio data frame, the audio and video processing method further includes: when the time of receiving the first frame of audio data frame is later than the time of receiving the first frame of video data frame. , the time interval between receiving the first frame of audio data frame and receiving the first frame of video data frame is determined as the expected audio playback time.

In actual operation, there may be a certain time interval between the first frame of video data frame and the first frame of audio data frame when they are sent for the first time due to various reasons. In this case, in order to continue to ensure that the audio data The correspondence between frames and video data frames requires that the two maintain a fixed time interval initially. Therefore, when the audio data frame is the first frame and is later than the first video data frame, it cannot Directly determine the expected audio playback time to 0 seconds, and a time delay of a time interval needs to be maintained.

In some embodiments, the constraint relationship between the expected audio play time, the timestamp number, the initial audio timestamp number, and the initial expected audio play time can refer to the expected video play time, the timestamp number of the video data frame, the initial video timestamp number, and The constraint relationship of the expected playback time of the initial video.

In some embodiments, the audio and video processing method also includes:

When no audio data frame is received beyond the preset audio frame filling time threshold, copy the last received audio data frame;

The expected audio playback time of the copied audio data frame is determined according to the expected audio playback time of the second time interval corresponding to the last received audio data frame, where the second time interval represents the time interval between two received audio data frames.

During audio data transmission, if packet loss occurs, there may be no audio data frame for a certain period of time. At this time, if the audio sequence number jump value is determined before supplementing the frame, it will cause certain problems. Delay, then in this case, you can directly use the audio frame filling time threshold to copy the previous frame to compensate for the lack of this audio data frame. Every time the audio frame filling time threshold is exceeded, no audio is received. The data frame uses the previous frame in turn to complete the frame complement. Until the audio data frame is received normally, the audio data frame will be used to complete the determination of the expected audio playback time.

In order to explain more clearly the processing flow of the audio and video processing method provided by the embodiment of the present application, examples are used to illustrate the following.

The audio and video processing method includes the following steps:

Obtain the first frame of audio data frame and the first frame of video data frame, determine the initial video expected play time and initial video timestamp sequence number of the video data stream based on the first frame of video data frame, and determine the audio data based on the first frame of audio data frame The initial audio expected play time and initial audio timestamp sequence number of the stream; when initialized for the first time, the initial video timestamp sequence number and the initial audio timestamp sequence number will usually remain consistent, and the initial video expected play time and initial audio expected play time will be based on NTP (Network Time Protocol) protocol determines a time interval;

Continuously receive audio data frames and video data frames, and continue to record the timestamp number of the first two received audio data frames and the timestamp number of the video data frame. After receiving the current video data frame or audio data frame, you can use the corresponding The timestamp number of the previous two records and the timestamp number of this time determine the video number jump value or the audio number jump value;

By comparing the video sequence number jump value with the preset video sequence number jump threshold or comparing the audio sequence number jump value with the preset audio jump threshold, it can be determined whether a timestamp sequence number jump occurs;

When it is determined that there is no time stamp sequence number jump in the video data frame or audio data frame, the initial video timestamp sequence number, the initial video expected playback time and the timestamp sequence number of the video data frame can be directly used to determine the expected video playback time. Determine the expected audio playback time using the initial audio timestamp number, the initial audio expected playback time, and the timestamp number of the audio data frame;

After it is determined that the timestamp sequence number jumps in the video data frame or audio data frame, a first time interval is added to the expected video playback time corresponding to the last received video data frame to determine the expected video playback time after the jump, and at the same time , obtain the expected playback time of the first video, and update the initial video timestamp number and the initial video timestamp number according to the expected playback time of the first video, or the processing of the audio data frame can be completed based on the same principle as video processing, Obtain the expected audio playback time after the jump, and complete the update of the initial audio timestamp sequence number and the initial audio timestamp sequence number;

In addition, when it is determined that network packet loss occurs, you can directly copy the previous video data frame or audio data frame, and then complete the settings in the normal way of determining the expected video playback time or the expected audio playback time, so that you can Ensure the integrity and smoothness of video and audio playback.

The audio and video processing method of this application directly uses the expected video playback time and the expected audio playback time to construct a timeline, and determines the corresponding expected video playback time and audio expected playback time for each video data frame and audio data frame, so that each video Data frames and audio data frames are unique in the timeline, so that the entire media stream can achieve audio and video synchronization simply and clearly.

In addition, an embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium stores computer-executable instructions. The computer-executable instructions are used to execute the above audio and video processing method, for example, by the above Execution by a processor in the embodiment of the audio and video processing device can cause the above-mentioned processor to execute the information processing method in the above embodiment, for example, execute the method in Figure 1, the method in Figure 2 and the method in Figure 3 described above. Methods.

In addition, one embodiment of the present application also provides an audio and video processing device. The audio and video processing device includes: a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the computer program. When implementing the above audio and video processing method.

The non-transitory software programs and instructions required to implement the audio and video processing methods of the above embodiments are stored in the memory. When executed by the processor, the audio and video processing methods in the above embodiments are executed. For example, the above described Figure 1 is executed. The method in , the method in Figure 2 and the method in Figure 3 .

Embodiments of the present application include: acquiring video data frames and audio data frames; determining the video sequence number jump value based on the timestamp number of the video data frame received this time and the timestamp number of the video data frames received twice before; The jump value determines the expected playback time of the video, and the expected playback time of the video represents the playback time corresponding to the video data frame; the audio sequence number is determined based on the timestamp number of the audio data frame received this time and the timestamp number of the two previously received audio data frames. Jump value; Determine the expected audio playback time based on the audio serial number jump value. The expected audio playback time represents the playback time corresponding to the audio data frame. Among them, the expected audio playback time is consistent with the expected video playback time; According to the expected audio playback time and video It is expected that the audio data frame and the video data frame will be synchronized during playback time. The timestamp sequence number of the video data frame is used to determine the video sequence number jump threshold of the video data frame, so as to know the continuous status of the video and determine whether a jump occurs. The video sequence number jump threshold can then be used to determine the video data frame received this time. The basis of the expected playback time of the video avoids long-term repeated frames or dropped frames during the playback of video data frames. At the same time, it also enables the video data frames to be played accurately based on the expected playback time of the video; similarly, in When an audio data frame is received, the timestamp sequence number of the audio data frame can be used to determine the audio sequence number jump threshold of the audio data frame, so as to know the continuous state of the audio and determine whether a jump occurs, and then the audio sequence number jump threshold can be used as The basis for determining the expected audio playback time of the audio data frame received this time to avoid long-term repeated frames or frame loss during the playback of the audio data frame. At the same time, it also makes Audio data frames can be played accurately based on the expected playback time of the audio. Finally, because the expected audio playback time and the expected video playback time are consistent, the received audio data frames and video data frames can accurately correspond in time, achieving synchronization of audio data and video data.

Those of ordinary skill in the art can understand that all or some steps and systems in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is known to those of ordinary skill in the art, the term computer storage media includes volatile and nonvolatile media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. removable, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other storage cell technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or Any other medium that can be used to store the desired information and that can be accessed by a computer. Additionally, it is known to those of ordinary skill in the art that communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Claims

An audio and video processing method, including:

Get video data frame and audio data frame;

Determine the video sequence number jump value based on the timestamp sequence number of the video data frame received this time and the timestamp sequence number of the video data frame received twice previously;

The expected video playback time is determined according to the video sequence number jump value, and the expected video playback time represents the playback time corresponding to the video data frame;

Determine the audio sequence number jump value according to the timestamp sequence number of the audio data frame received this time and the timestamp sequence number of the audio data frame received twice previously;

The expected audio playback time is determined according to the audio sequence number jump value, and the expected audio playback time represents the playback time corresponding to the audio data frame, wherein the expected audio playback time is consistent with the expected video playback time;

The audio data frame and the video data frame are synchronized according to the desired audio playback time and the desired video playback time.
The audio and video processing method according to claim 1, wherein the video sequence number jump is determined based on the timestamp number of the video data frame received this time and the timestamp number of the video data frame received twice before. Values, including:

Determine the first sequence number difference based on the timestamp sequence numbers of the previously received two video data frames;

Determine the second sequence number difference based on the timestamp sequence number of the video data frame received this time and the timestamp sequence number of the video data frame received last time;

The video sequence number jump value is calculated based on the second sequence number difference and the first sequence number difference.
The audio and video processing method according to claim 1, wherein the determining the expected video playback time according to the video sequence number jump value includes:

When the video sequence number jump value is greater than the preset video sequence number jump threshold, a first time interval is added to the expected video playback time corresponding to the last received video data frame to determine the expected video playback time. , obtain the expected playback time of the first video, wherein the first time interval represents the time interval between receiving the video data frame twice.
The audio and video processing method according to claim 3, wherein the determining the expected video playback time according to the video sequence number jump value includes:

When the video sequence number jump value is less than the video sequence number jump threshold, the expected video playback time is determined based on the initial video timestamp sequence number, the initial video expected playback time and the timestamp sequence number of the video data frame, and the second desired playback time is obtained. The expected video playback time, wherein the initial video timestamp number is obtained according to the timestamp number of the video data frame received this time; the initial video expected playback time is based on the video of the video data frame received this time. Expect play time and get.
The audio and video processing method according to claim 4, wherein the initial video timestamp serial number and the initial video expected playback time are obtained by the following steps:

When the video data frame is the received first video data frame, the timestamp number of the first video data frame is determined as the initial video timestamp number, and the corresponding first video data frame is The expected video play time is determined as the initial video expected play time.
The audio and video processing method according to claim 4 or 5, further comprising:

When the video sequence number jump value is greater than the video sequence number jump threshold, according to the video data frame received this time The timestamp serial number updates the initial video timestamp serial number;

The initial video expected play time is updated according to the first video expected play time.
The audio and video processing method according to claim 1, wherein when the video data frame is the received first video data frame, the audio and video processing method further includes:

When the time at which the first video data frame is received is earlier than or equal to the time at which the first audio data frame is received, the expected video playback time is set to a preset time value.
The audio and video processing method according to claim 1, wherein when the video data frame is the received first video data frame, the audio and video processing method further includes:

When the time of receiving the first video data frame is later than the time of receiving the first audio data frame, the time interval between receiving the first video data frame and receiving the first audio data frame is determined. The desired playing time for the video.
The audio and video processing method according to claim 1, further comprising:

When the video data frame is not received beyond the preset video frame filling time threshold, copy the last received video data frame;

Determine the expected video playback time of the copied video data frame according to the first time interval and the video expected playback time corresponding to the last time the video data frame was received, wherein the first time interval represents two receptions The time interval between frames of video data.
The audio and video processing method according to claim 1, wherein the audio sequence number jump is determined based on the timestamp number of the audio data frame received this time and the timestamp number of the audio data frame received twice before. Values, including:

Determine the third sequence number difference based on the timestamp sequence numbers of the audio data frames received twice previously;

Determine a fourth sequence number difference based on the timestamp sequence number of the audio data frame received this time and the timestamp sequence number of the audio data frame received last time;

The audio sequence number jump value is calculated according to the third sequence number difference and the fourth sequence number difference.
The audio and video processing method according to claim 1, wherein determining the expected audio playback time according to the audio sequence number jump value includes:

When the audio sequence number jump value is greater than the preset audio sequence number jump threshold, add a second time interval to the audio expected playback time corresponding to the last received audio data frame, and determine the audio expected playback time, The first audio expected playback time is obtained, wherein the second time interval represents the time interval between receiving the audio data frame twice.
The audio and video processing method according to claim 11, further comprising:

When the audio sequence number jump value is less than the audio sequence number jump threshold, the expected audio playback time is determined based on the initial audio timestamp sequence number, the initial audio expected playback time and the timestamp sequence number of the audio data frame, and we obtain The second audio expected play time, wherein the initial audio timestamp number is obtained according to the timestamp number of the audio data frame received this time, and the initial audio expected play time is based on the audio data frame received this time. The desired playing time of the audio is obtained.
The audio and video processing method according to claim 12, wherein the initial audio timestamp serial number and the initial audio expected playback time are obtained by the following steps:

When the audio data frame is the received first audio data frame, the timestamp number of the first audio data frame is determined as the initial audio timestamp number, and the corresponding audio data frame of the first frame is The desired audio playback time is determined as the initial desired audio playback time.
The audio and video processing method according to claim 12 or 13, further comprising:

When the audio sequence number jump value is greater than the audio sequence number jump threshold, update the initial audio timestamp sequence number according to the timestamp sequence number of the audio data frame received this time;

The initial audio expected play time is updated according to the first audio expected play time.
The audio and video processing method according to claim 1, wherein when the audio data frame is the first received audio data frame, the audio and video processing method further includes:

When the time of receiving the first frame of audio data frame is earlier than or equal to the time of receiving the first frame of video data frame, the expected audio playback time is set to a preset time value.
The audio and video processing method according to claim 1 or 15, wherein when the audio data frame is the first received audio data frame, the audio and video processing method further includes:

When the time of receiving the first frame of audio data frame is later than the time of receiving the first frame of video data frame, the time interval between receiving the first frame of audio data frame and receiving the first frame of video data frame is determined. The desired playing time for the audio.
The audio and video processing method according to claim 1, further comprising:

When the audio data frame is not received beyond the preset audio frame filling time threshold, copy the last received audio data frame;

The expected audio playback time of the copied audio data frame is determined according to the expected audio playback time of the last time the audio data frame was received in a second time interval, where the second time interval represents two receptions. The time interval between the audio data frames.
An audio and video processing device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements any one of claims 1 to 17 The audio and video processing method.
A computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to execute the audio and video processing method described in any one of claims 1 to 17.