CN116708892A

CN116708892A - Sound and picture synchronous detection method, device, equipment and storage medium

Info

Publication number: CN116708892A
Application number: CN202310813430.4A
Authority: CN
Inventors: 戴智勇; 接宏恩; 王继成
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-09-05

Abstract

The embodiment of the disclosure provides a sound and picture synchronization detection method, a device, equipment and a storage medium. The method comprises the following steps: acquiring a first video to be detected, wherein the first video is obtained by transcoding a second video with synchronous audios and pictures; extracting a first key frame in the first video and a second key frame in the second video, aligning the first key frame with the second key frame, and determining a target second key frame aligned with the first key frame; extracting first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame; and matching the first audio data with the second audio data to determine the audio-video synchronization detection result of the first video. According to the technical scheme, the audio and video synchronization can be automatically detected after transcoding, so that the audio and video synchronization detection efficiency is improved, and the accuracy of audio and video synchronization detection is guaranteed.

Description

Sound and picture synchronous detection method, device, equipment and storage medium

Technical Field

The embodiment of the disclosure relates to computer technology, in particular to a sound and picture synchronization detection method, a sound and picture synchronization detection device, sound and picture synchronization detection equipment and a storage medium.

Background

With the rapid development of computer technology, transcoding of video is often required. However, during the video transcoding process, the situation that the audio and video pictures are not synchronous often occurs, that is, the audio and video pictures cannot be accurately consistent. For example, a video picture display is speaking but has no corresponding sound, greatly reducing the user viewing experience. At present, whether the situation of asynchronous audio and video caused by transcoding exists is automatically judged by a mode of manually watching the video. Therefore, the manual detection mode is time-consuming and labor-consuming, and the detection efficiency of sound and picture synchronization is reduced.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a device, and a storage medium for detecting sound and picture synchronization, so as to automatically detect sound and picture synchronization of a transcoded video, thereby improving the detection efficiency of sound and picture synchronization and ensuring the accuracy of sound and picture synchronization detection.

In a first aspect, an embodiment of the present disclosure provides a method for detecting synchronization of audio and video, including:

acquiring a first video to be detected, wherein the first video is obtained by transcoding a second video with synchronous audios and pictures;

extracting a first key frame in the first video and a second key frame in the second video, aligning the first key frame with the second key frame, and determining a target second key frame aligned with the first key frame;

Extracting first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame;

and matching the first audio data with the second audio data, and determining an audio-video synchronization detection result of the first video.

In a second aspect, an embodiment of the present disclosure further provides an audio and video synchronization detection apparatus, including:

the first video acquisition module is used for acquiring a first video to be detected, wherein the first video is obtained by transcoding a second video with synchronous audios and pictures;

the key frame alignment module is used for extracting a first key frame in the first video and a second key frame in the second video, aligning the first key frame with the second key frame and determining a target second key frame aligned with the first key frame;

the audio data extraction module is used for extracting first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame;

and the audio-video synchronization detection module is used for matching the first audio data with the second audio data and determining an audio-video synchronization detection result of the first video.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for detecting synchronization of audio and video as described in any one of the embodiments of the present disclosure.

In a fourth aspect, the presently disclosed embodiments also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing the sound and picture synchronization detection method according to any of the presently disclosed embodiments.

According to the embodiment of the disclosure, the second video synchronized with the audio and the video before transcoding is taken as a reference, the first video after transcoding is detected, and the first audio data corresponding to the first key frame after key frame alignment is matched with the second audio data corresponding to the second key frame of the target, so that whether the first video is asynchronous with the audio and the video caused by transcoding can be accurately determined, automatic detection of audio and the video synchronization is realized, and accuracy of audio and video synchronization detection is ensured.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a schematic flow chart of a method for detecting synchronization of audio and video according to an embodiment of the disclosure;

FIG. 2 is a waveform matching diagram of first audio data and second audio data in accordance with an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating another method for detecting synchronization of audio and video according to an embodiment of the present disclosure;

FIG. 4 is an example of audio data extraction in accordance with embodiments of the present disclosure;

fig. 5 is a schematic structural diagram of an audio-video synchronization detection device according to an embodiment of the disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Fig. 1 is a schematic flow chart of a method for detecting synchronization of audio and video provided by an embodiment of the present disclosure, where the embodiment of the present disclosure is applicable to a situation of performing synchronization detection of audio and video on a transcoded video, the method may be performed by an apparatus for detecting synchronization of audio and video, and the apparatus may be implemented in a form of software and/or hardware, optionally, by an electronic device, where the electronic device may be a mobile terminal, a PC end, a server, or the like.

As shown in fig. 1, the audio-video synchronization detection method specifically includes the following steps:

s110, acquiring a first video to be detected, wherein the first video is obtained by transcoding a second video with synchronous audios and videos.

The first video refers to a video to be detected after transcoding. The second video refers to the video before transcoding. The audio and video pictures in the second video are synchronous, so that the second video can be used as a reference video to detect the audio and video synchronization of the first video.

Specifically, a user may make a video and distribute the video in a client. After receiving the video issued by the client, the server needs to transcode the uploaded video, such as adjusting the resolution and code rate of the video, so as to improve the picture quality, and send the transcoded video to other clients for playing. In the application scenario, the video uploaded to the server may be used as the second video for audio-video synchronization, and the transcoded video may be used as the first video to be detected. By taking the second video before transcoding as a reference video for audio-video synchronization, whether the first video is asynchronous with audio-video caused by video transcoding can be accurately detected.

S120, extracting a first key frame in the first video and a second key frame in the second video, aligning the first key frame with the second key frame, and determining a target second key frame aligned with the first key frame.

The key frame may refer to a video frame that plays a decisive role for video content, such as a video frame with scene switching, a video frame with motion change, a video frame in which a key action is located, and the like. The first keyframe may refer to a keyframe in the first video. The number of first key frames may be one or more. The second keyframe may refer to a keyframe in the second video. The number of second key frames may also be one or more. The number of first key frames may be the same as or different from the number of second key frames. The alignment refers to an operation of aligning key frames having the same picture. The target second key frame refers to a second key frame having the same picture as the first key frame.

Specifically, key frames of the first video and the second video are detected, and all first key frames in the first video and all second key frames in the second video are extracted. The first key frame and the second key frame may be aligned by detecting image similarity between the first key frame and the second key frame. For example, for each first key frame, an image similarity between the first key frame and each second key frame may be determined, and a second key frame having the highest image similarity and greater than or equal to a preset similarity threshold may be used as a target second key frame aligned with the first key frame, so as to obtain a target second key frame having the same picture as the first key frame.

The image similarity can be characterized by using any index capable of measuring the similarity of two images. For example, image similarity is characterized using a structural similarity SSIM (Structural Similarity) index. The value of the SSIM index ranges from 0 to 1, and a larger SSIM index indicates a higher image similarity.

It should be noted that if the target second key frame identical to a certain first key frame does not exist in all the second key frames, it indicates that the first key frame cannot be aligned, and at this time, the first key frame may be ignored, and only the aligned first key frame is used to perform subsequent audio-video synchronization detection.

S130, extracting first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame.

Specifically, for each first key frame, first audio data corresponding to a preset audio duration may be extracted at a first key frame position in the first video. And extracting second audio data corresponding to the preset audio duration at a target second key frame position in the second video, so that the extracted first audio data and second audio data have the same audio duration. For example, the extracted first audio data is audio data with a preset audio duration by taking a play time stamp of the first key frame as a preset reference time. The extracted second audio data takes the playing time stamp of the target second key frame as a preset reference time, and has the preset audio time length. The preset reference time may refer to a start time, a center time, or an end time of the audio data.

It should be noted that, the picture contents of the aligned first key frame and the target second video frame are the same, but the corresponding play time stamps may be different. When the extracted first audio data and second audio data have the same audio time length, the first audio data and the second audio data need to be extracted at the same position of the key frame so as to ensure the accuracy of subsequent sound-picture synchronous detection. For example, first audio data having a preset audio duration with a play time stamp of a first key frame as a start time is extracted from a first video. And extracting second audio data with preset audio time length from the second video, wherein the second audio data takes the playing time stamp of the target second key frame as the starting time.

And S140, matching the first audio data with the second audio data, and determining an audio-video synchronization detection result of the first video.

Specifically, fig. 2 shows a waveform matching diagram of the first audio data and the second audio data. As shown in fig. 2, the first audio data and the second audio data have the same audio duration, for which the result of matching between the first audio data and the second audio data can be determined based on the degree of similarity between the two audio waveforms in fig. 2. For example, the covariance between the first audio data and the second audio data may be used as an audio similarity between the first audio data and the second audio data, and when the audio similarity is greater than or equal to a preset similarity threshold, it is determined that the first audio data and the second audio data are successfully matched. And when the audio similarity is smaller than a preset similarity threshold, determining that the first audio data and the second audio data fail to be matched. If the match is successful, it indicates that the audio and video pictures at the first key frame location are synchronized. If the matching fails, it indicates that the audio and video pictures at the first key frame position are not synchronized, and there are cases where the audio and video pictures are not synchronized due to transcoding. According to the matching result corresponding to each first key frame in the first video, the audio-video synchronization detection result of the first video can be determined. For example, if the matching results corresponding to all the first key frames in the first video are successful, it is determined that the sound-picture synchronization detection result of the first video is sound-picture synchronization. If the matching result corresponding to at least one first key frame in the first video is a matching failure, determining that the audio-video synchronization detection result of the first video is an audio-video asynchronous, and if the audio-video asynchronous caused by transcoding exists in the first video.

According to the technical scheme, the second video synchronized with the audio and the video before transcoding is used as a reference to detect the first video after transcoding, and the first audio data corresponding to the first key frame after key frame alignment is matched with the second audio data corresponding to the second key frame of the target, so that whether the first video is asynchronous with the audio and the video caused by transcoding can be accurately determined, automatic detection of audio and video synchronization is achieved, and accuracy of audio and video synchronization detection is guaranteed.

As an alternative embodiment, "align first key frame and second key frame, determine target second key frame aligned with first key frame" in S120 may include: acquiring a second key frame sequence of the first key frame option; sequentially acquiring current second key frames according to the sequence of the second key frames, and determining the image similarity between the first key frames and the current second key frames; and if the image similarity is greater than or equal to a preset similarity threshold, determining the current second video frame as a target second key frame aligned with the first key frame.

Specifically, all the extracted first key frames may be ordered according to the video playing order, so as to obtain a first key frame sequence, such as { A1, A2, … …, an }. Similarly, all the extracted second key frames may be ordered according to the video playing sequence, so as to obtain a second key frame sequence { B1, B2, … …, bm }. The target second keyframe for each first keyframe alignment may be determined from the second keyframe sequence in turn, in the order of the first keyframe sequence.

For example, the optional second key frame sequence of the first key frame A1 is { B1, B2, … …, bm }, where according to the order of the second key frame sequence, the second key frame B1 may be first used as the current second key frame, and whether the image similarity between the first key frame A1 and the second key frame B1 is greater than or equal to the preset similarity threshold value is detected; if yes, the first key frame A1 is similar to the second key frame B1, and the second key frame B1 can be directly used as a target second key frame aligned with the first key frame A1; if not, the next second key frame B2 is used as the current second key frame to re-detect whether the image similarity between the first key frame A1 and the second key frame B2 is larger than or equal to a preset similarity threshold value or not until the second key frame matched with the first key frame A1 is determined, so that the alignment of the first key frame A1 is completed. Assuming that the target second key frame aligned with the first key frame A1 is B2, when the second first key frame A2 is aligned, since the target second key frame aligned with the first key frame A2 appears in the key frames following the second key frame B2, the second key frames following the second key frame B2 may be regarded as optional second key frame sequences of the first key frame A2, i.e., { B3, B4, … …, bm }, and the target second key frame aligned with the first key frame A2 may be determined from the second key frame sequences { B3, B4, … …, bm }, based on the same detection procedure as described above. And the same is sequentially carried out, so that the target second key frame aligned with each first key frame can be obtained more quickly, the key frame alignment speed is improved, and the sound and picture synchronization detection efficiency is further improved.

Fig. 3 is a flowchart of another audio-video synchronization detection method provided by the embodiment of the present disclosure, where the audio time of the extracted first audio data is longer than the audio time of the second audio data on the basis of the above-mentioned embodiment of the present disclosure, and on the basis of this, the step of "matching the first audio data with the second audio data to determine the audio-video synchronization detection result of the first video" is optimized. Wherein the same or corresponding terms as those of the above-described embodiments are not explained in detail herein.

As shown in fig. 3, the audio-video synchronization detection method specifically includes the following steps:

s310, acquiring a first video to be detected, wherein the first video is obtained by transcoding a second video with synchronous audios and videos.

S320, extracting a first key frame in the first video and a second key frame in the second video, aligning the first key frame with the second key frame, and determining a target second key frame aligned with the first key frame.

S330, extracting first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame.

The audio time length corresponding to the first audio data is a first preset time length, and the audio time length corresponding to the second audio data is a second preset time length. The first preset time period is longer than the second preset time period. For example, the first preset duration is 500ms and the second preset duration is 20ms.

Specifically, for each first key frame, first audio data corresponding to a first preset duration may be extracted at a first key frame position in the first video. And extracting second audio data corresponding to a second preset duration at a target second key frame position in the second video, so that the audio time length of the extracted first audio data is longer than the audio time length of the second audio data, and the second audio data can be matched in the first audio data later.

Illustratively, the extraction of the first audio data requires extraction within a range of the first key frame location for accurate audio matching to follow. For example, the first audio data may be audio data having a first preset duration centered on a play time stamp of the first key frame. Fig. 4 gives an example of the first audio data extraction. Referring to fig. 4, first audio data having a first preset duration of 500ms centering on a play time stamp of each first key frame is extracted from audio of a first video.

For example, there may be a plurality of extraction ways for the second audio data as a reference, which may be extracted to the left, or to the left and right at the target second key frame position. For example, the second audio data may be audio data having a second preset duration with a play time stamp of the target second key frame as a preset reference time. The preset reference time may be one of the following: start time, center time and end time. Fig. 4 also gives an example of a second audio data extraction. Referring to fig. 4, second audio data having a second preset duration of 20ms with a play time stamp of each target second key frame as a start time is extracted from audio of the second video.

S340, determining target audio data which has a second preset duration and is matched with the second audio data from the first audio data.

The target audio data may refer to audio data that is successfully matched with the second audio data in the first audio data. That is, the target audio data and the second audio data may be approximately the same data.

Specifically, since the time length of the first audio data is longer than the time length of the second audio data, the first audio data can be traversed, and a plurality of pieces of audio data to be selected with a second preset time length are extracted, for example, one piece of audio data to be selected with the second preset time length is extracted every 1 ms. The audio similarity between each piece of the audio data to be selected and the second audio data can be determined, and the audio data to be selected with the largest audio similarity is used as target audio data matched with the second audio data.

Illustratively, S340 may include: sliding the first audio data based on the preset sliding time length and the second preset time length to obtain current audio data with the second preset time length corresponding to the current sliding; determining an audio similarity between the current audio data and the second audio data; and determining target audio data matched with the second audio data based on the audio similarity corresponding to the current audio data.

Specifically, the first audio data may be successively slid based on a preset sliding time length, so as to obtain current audio data having a second preset time length. For example, when the preset sliding duration is 1ms, referring to fig. 4, the current audio data obtained by the first sliding is the audio data from 0ms to 20ms in the first audio data, the current audio data obtained by the second sliding is the audio data from 1ms to 21ms in the first audio data, and so on. After each slide, a covariance between the current audio data obtained by the current slide and the second audio data may be determined, and the covariance may be used as an audio similarity corresponding to the current audio data. If the audio similarity is greater than or equal to a preset similarity threshold, it may be determined that the current audio data matches the second audio data, where the current audio data is used as the target audio data. If the audio similarity is smaller than a preset similarity threshold, it can be determined that the current audio data is not matched with the second audio data, at this time, the current audio data needs to be updated through sliding, whether the current audio data obtained through next sliding is matched with the second audio data or not is detected, and sliding is stopped until the matched target audio data is determined. By the sliding matching mode, all audio data with the second preset duration in the first audio data do not need to be traversed, and the target audio data matched with the second audio data can be obtained more rapidly on the basis of guaranteeing the synchronous detection accuracy, so that the detection efficiency of sound and picture synchronous detection is further improved.

S350, determining the sound-picture offset between the target audio data and the first key frame by taking the second audio data and the target second key frame as sound-picture synchronization references.

The audio-visual offset may refer to a time difference between a sound of the target audio data and a picture of the first key frame.

Specifically, since the second video is audio-video synchronous, and thus the second audio data and the target second key frame are audio-video synchronous, the first time difference between the second audio data and the target second key frame can be used as the reference time difference of audio-video synchronization, the target time difference between the target audio data and the first key frame can be detected, if the target time difference is equal to the reference time difference, it is indicated that the target audio data and the first key frame are audio-video synchronous, and the audio-video offset at this time is 0. If the target time difference is not equal to the reference time difference, a specific audio-visual offset can be determined based on the difference between the target time difference and the reference time difference.

Illustratively, S350 may include: determining a reference time difference when the audio and the picture are synchronized based on the start time stamp of the second audio data and the play time stamp of the target second key frame; determining a target time difference corresponding to the first key frame based on the start time stamp of the target audio data and the play time stamp of the first key frame; and determining the sound-picture offset corresponding to the first key frame based on the target time difference and the reference time difference.

Specifically, the time difference obtained by subtracting the play time stamp of the target second key frame from the start time stamp of the second audio data may be determined as the reference time difference at the time of the sound-picture synchronization. For the extraction manner of the second audio data in fig. 4, the start timestamp of the second audio data is the play timestamp of the target second key frame, so that the reference time difference during audio-video synchronization is 0. And subtracting the playing time stamp of the first key frame from the starting time stamp of the target audio data, and determining the obtained time difference value as a target time difference corresponding to the first key frame. And subtracting the reference time difference from the target time difference, and determining the obtained time difference as the sound-picture offset corresponding to the first key frame. Referring to fig. 4, the reference parallax is 0, and the target time difference is the audio-visual offset. If the offset of the audio and the video is greater than 0, the audio and the video are indicated to lead the video. If the audio-visual offset is less than 0, the audio-visual offset is indicated to lag the video picture.

S360, determining an audio-video synchronization detection result of the first video based on the audio-video offset corresponding to the first key frame.

Specifically, based on the audio-visual offset corresponding to each first key frame in the first video, an audio-visual synchronization detection result of the first video as a whole can be determined.

Illustratively, S360 may include: if the sound-picture offset corresponding to each first video frame in the first video is within a preset allowable range, determining that the sound-picture synchronization detection result of the first video is sound-picture synchronization; if at least one audio-video offset corresponding to the first video frame exists in the first video and is not in the preset allowable range, determining that the audio-video synchronization detection result of the first video is that the audio-video is asynchronous.

The preset allowable range may be a time difference range set in advance based on a detection standard, where the audio-visual offset is allowed. For example, the preset allowable range is: [ -185, 90], which indicates that either an audio lag of 185ms or a lead of 90ms is acceptable, corresponding to a video picture, is considered to be sound-picture synchronized.

Specifically, whether the audio-video offset corresponding to each first video frame in the first video is within a preset allowable range is detected, if all the audio-video offsets corresponding to the first video frames are within the preset allowable range, it is determined that the audio-video synchronization detection result of the first video is audio-video synchronization, that is, the transcoded first video is audio-video synchronization, and the video transcoding does not cause the situation that the audio-video is asynchronous. If at least one audio-video offset corresponding to the first video frame is not in the preset allowable range, determining that the audio-video synchronization detection result of the first video is audio-video unsynchronization, that is, the situation that the audio-video is unsynchronized due to transcoding occurs at the position of the first video frame which is not in the preset allowable range, and outputting the audio-video offset corresponding to the first video frame which is not in the preset allowable range at this time so as to remind the user of the specific position of the audio-video unsynchronization and the degree of the audio-video unsynchronization, so that the user can perform quick processing.

According to the technical scheme, the first audio data with longer audio duration is extracted, the target audio data matched with the second audio data is extracted again from the first audio data, the second audio data and the target second key frame are used as sound and picture synchronization references, the sound and picture offset between the target audio data and the first key frame is determined, and the sound and picture synchronization detection result of the first video can be determined more accurately based on the sound and picture offset corresponding to the first key frame, so that the accuracy of sound and picture synchronization detection is further improved.

Fig. 5 is a schematic structural diagram of an audio-video synchronization detection device according to an embodiment of the present disclosure, as shown in fig. 5, where the device specifically includes: a first video acquisition module 510, a key frame alignment module 520, an audio data extraction module 530, and a sound and picture synchronization detection module 540.

The first video obtaining module 510 is configured to obtain a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronous audio and video; a key frame alignment module 520, configured to extract a first key frame in the first video and a second key frame in the second video, align the first key frame and the second key frame, and determine a target second key frame aligned with the first key frame; an audio data extraction module 530, configured to extract first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame; and the audio-video synchronization detection module 540 is configured to match the first audio data with the second audio data, and determine an audio-video synchronization detection result of the first video.

According to the technical scheme provided by the embodiment of the disclosure, the transcoded first video is detected by taking the transcoded second video with the synchronous audio and video as the reference, and the first audio data corresponding to the first key frame after the key frame is aligned and the second audio data corresponding to the target second key frame are matched, so that whether the first video is asynchronous with the audio and video caused by transcoding can be accurately determined, automatic detection of audio and video synchronization is realized, and accuracy of audio and video synchronization detection is ensured.

Based on the above technical solution, the key frame alignment module 520 is specifically configured to:

acquiring a second key frame sequence which is selectable by the first key frame; sequentially acquiring current second key frames according to the sequence of the second key frames, and determining the image similarity between the first key frames and the current second key frames; and if the image similarity is greater than or equal to a preset similarity threshold, determining the current second video frame as a target second key frame aligned with the first key frame.

On the basis of the above technical solutions, the audio duration corresponding to the first audio data is a first preset duration, and the audio duration corresponding to the second audio data is a second preset duration, where the first preset duration is longer than the second preset duration.

On the basis of the above technical solutions, the first audio data is audio data with a first preset duration and takes a play time stamp of the first key frame as a center time;

the second audio data takes the playing time stamp of the target second key frame as a preset reference time and has a second preset duration; wherein the preset reference time is one of the following:

start time, center time and end time.

Based on the above technical solutions, the audio-visual synchronization detection module 540 includes:

a target audio data determining unit, configured to determine target audio data that has the second preset duration and matches the second audio data from the first audio data;

the sound-picture offset determining unit is used for determining the sound-picture offset between the target audio data and the first key frame by taking the second audio data and the target second key frame as sound-picture synchronous references;

and the sound and picture synchronization detection unit is used for determining a sound and picture synchronization detection result of the first video based on the sound and picture offset corresponding to the first key frame.

Based on the above technical solutions, the target audio data determining unit is specifically configured to:

Sliding the first audio data based on a preset sliding time length and the second preset time length to obtain current audio data with the second preset time length corresponding to current sliding; determining an audio similarity between the current audio data and the second audio data; and determining target audio data matched with the second audio data based on the audio similarity corresponding to the current audio data.

On the basis of the above technical solutions, the audio-visual offset determining unit is specifically configured to:

determining a reference time difference when the audio and the video are synchronized based on the start time stamp of the second audio data and the play time stamp of the target second key frame; determining a target time difference corresponding to the first key frame based on the start time stamp of the target audio data and the play time stamp of the first key frame; and determining the sound-picture offset corresponding to the first key frame based on the target time difference and the reference time difference.

On the basis of the technical schemes, the audio-video synchronous detection unit is specifically used for:

if the sound-picture offset corresponding to each first video frame in the first video is within a preset allowable range, determining that the sound-picture synchronization detection result of the first video is sound-picture synchronization; if at least one audio-video offset corresponding to the first video frame exists in the first video and is not in a preset allowable range, determining that the audio-video synchronization detection result of the first video is that the audio-video is asynchronous.

The sound and picture synchronization detection device provided by the embodiment of the disclosure can execute the sound and picture synchronization detection method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of executing the sound and picture synchronization detection method.

It should be noted that each unit and module included in the above apparatus are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for convenience of distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present disclosure.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. Referring now to fig. 6, a schematic diagram of an electronic device (e.g., a terminal device or server in fig. 6) 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 6 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 6, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An edit/output (I/O) interface 505 is also connected to bus 504.

In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501.

The electronic device provided by the embodiment of the present disclosure and the audio-video synchronization detection method provided by the foregoing embodiment belong to the same inventive concept, and technical details not described in detail in the present embodiment may be referred to the foregoing embodiment, and the present embodiment has the same beneficial effects as the foregoing embodiment.

The embodiment of the present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the sound-and-picture synchronization detection method provided by the above embodiment.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a first video to be detected, wherein the first video is obtained by transcoding a second video with synchronous audios and pictures; extracting a first key frame in the first video and a second key frame in the second video, aligning the first key frame with the second key frame, and determining a target second key frame aligned with the first key frame; extracting first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame; and matching the first audio data with the second audio data, and determining an audio-video synchronization detection result of the first video.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a sound and picture synchronization detection method, including:

According to one or more embodiments of the present disclosure, there is provided a sound-picture synchronization detection method, further including:

optionally, aligning the first key frame and the second key frame, determining a target second key frame aligned with the first key frame, including:

acquiring a second key frame sequence which is selectable by the first key frame;

sequentially acquiring current second key frames according to the sequence of the second key frames, and determining the image similarity between the first key frames and the current second key frames;

And if the image similarity is greater than or equal to a preset similarity threshold, determining the current second video frame as a target second key frame aligned with the first key frame.

According to one or more embodiments of the present disclosure, there is provided a sound and picture synchronization detection method, further including:

optionally, the audio duration corresponding to the first audio data is a first preset duration, and the audio duration corresponding to the second audio data is a second preset duration, where the first preset duration is longer than the second preset duration.

optionally, the first audio data is audio data with a first preset duration and takes a playing time stamp of the first key frame as a center time;

start time, center time and end time.

Optionally, the matching the first audio data with the second audio data, and determining the result of detecting the synchronization of the audio and the video of the first video includes:

determining target audio data which has the second preset duration and is matched with the second audio data from the first audio data;

determining the sound-picture offset between the target audio data and the first key frame by taking the second audio data and the target second key frame as sound-picture synchronization references;

and determining an audio-video synchronization detection result of the first video based on the audio-video offset corresponding to the first key frame.

optionally, the determining, from the first audio data, the target audio data having the second preset duration and matching the second audio data includes:

sliding the first audio data based on a preset sliding time length and the second preset time length to obtain current audio data with the second preset time length corresponding to current sliding;

determining an audio similarity between the current audio data and the second audio data;

And determining target audio data matched with the second audio data based on the audio similarity corresponding to the current audio data.

optionally, the determining, based on the second audio data and the target second key frame as a sound-to-picture synchronization reference, a sound-to-picture offset between the target audio data and the first key frame includes:

determining a reference time difference when the audio and the video are synchronized based on the start time stamp of the second audio data and the play time stamp of the target second key frame;

determining a target time difference corresponding to the first key frame based on the start time stamp of the target audio data and the play time stamp of the first key frame;

and determining the sound-picture offset corresponding to the first key frame based on the target time difference and the reference time difference.

optionally, the determining, based on the audio-visual offset corresponding to the first key frame, an audio-visual synchronization detection result of the first video includes:

If the sound-picture offset corresponding to each first video frame in the first video is within a preset allowable range, determining that the sound-picture synchronization detection result of the first video is sound-picture synchronization;

if at least one audio-video offset corresponding to the first video frame exists in the first video and is not in a preset allowable range, determining that the audio-video synchronization detection result of the first video is that the audio-video is asynchronous.

According to one or more embodiments of the present disclosure, there is provided an audio-visual synchronization detection apparatus, including:

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A sound and picture synchronous detection method is characterized by comprising the following steps:

2. The method of claim 1, wherein aligning the first key frame with the second key frame, determining a target second key frame aligned with the first key frame, comprises:

3. The method for detecting synchronization of audio and video according to claim 1, wherein the audio duration corresponding to the first audio data is a first preset duration, and the audio duration corresponding to the second audio data is a second preset duration, and wherein the first preset duration is longer than the second preset duration.

4. The sound and picture synchronization detection method according to claim 3, wherein the first audio data is audio data having a first preset duration with a play time stamp of the first key frame as a center time;

Start time, center time and end time.

5. The method of claim 3, wherein the matching the first audio data with the second audio data to determine the result of the audio-video synchronization detection of the first video comprises:

6. The sound and picture synchronization detection method according to claim 5, wherein the determining, from the first audio data, target audio data having the second preset duration and matching the second audio data includes:

7. The method for detecting synchronization of audio and video according to claim 5, wherein the determining the offset of audio and video between the target audio data and the first key frame using the second audio data and the target second key frame as audio and video synchronization references comprises:

8. The method for detecting synchronization of audio and video according to claim 5, wherein the determining the result of synchronization of audio and video of the first video based on the audio and video offset corresponding to the first key frame includes:

9. A sound and picture synchronization detection device, comprising:

10. An electronic device, the electronic device comprising:

one or more processors;

storage means for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the sound and picture synchronization detection method as claimed in any one of claims 1 to 8.

11. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing the sound and picture synchronization detection method according to any one of claims 1-8.