CN111093108B

CN111093108B - Sound and picture synchronization judgment method and device, terminal and computer readable storage medium

Info

Publication number: CN111093108B
Application number: CN201911311808.0A
Authority: CN
Inventors: 梁衍鹏
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2021-12-03
Anticipated expiration: 2039-12-18
Also published as: CN111093108A

Abstract

The disclosure provides a sound and picture synchronization judgment method, a device, a terminal and a computer readable storage medium, and belongs to the technical field of audio and video processing. The method comprises the following steps: the method comprises the steps of obtaining first audio data corresponding to a target video frame of first multimedia data received from a server, determining first fingerprint information of the first audio data, comparing the first fingerprint information with at least one piece of second fingerprint information, and determining sound and picture synchronization of the first multimedia data if the first fingerprint information is the same as any one piece of second fingerprint information. The fingerprint has uniqueness and can be used as an identifier of the audio data, and by the method, the fingerprint of the audio data is extracted, and the sound and picture synchronization is judged according to the fingerprint, so that manual observation is not needed, the sound and picture synchronization judging efficiency can be improved, errors caused by manual judgment can be avoided, and the judging accuracy is improved.

Description

Sound and picture synchronization judgment method and device, terminal and computer readable storage medium

Technical Field

The present disclosure relates to the field of audio and video processing technologies, and in particular, to a method, an apparatus, a terminal, and a computer-readable storage medium for determining audio and video synchronization.

Background

With the continuous development of computer technology, the entertainment industry relying on computers is also developed vigorously, the live broadcast industry is more and more recognized by common people, and the live broadcast of live broadcast doors is opened from the first game live broadcast to entertainment live broadcast, outdoor live broadcast and the like. In the process of audio and video live broadcast, the synchronization of sound playing and picture display, namely, the synchronization of sound and picture, is very important, if the abnormal condition that the sound and the picture are not synchronized occurs, the user experience can be seriously influenced, and audiences can mistakenly assume that the main broadcast is a fake broadcast, so that the reputation of the main broadcast is seriously influenced. At present, whether the sound and the picture are synchronous is judged mainly by a manual method, whether the sound and the picture are synchronous is judged by a person according to the played sound and the played picture, the judgment efficiency of sound and picture synchronization is low due to the fact that the person judges the sound and the picture in a certain time, and the judgment is very subjective due to the fact that the person judges the sound and the picture in a certain time, errors are easily caused to be judged, and the judgment accuracy of sound and picture synchronization is low.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device, a terminal and a computer readable storage medium for judging sound and picture synchronization, which can solve the problems of low efficiency and low accuracy of sound and picture synchronization judgment in the related art.

The technical scheme is as follows:

on one hand, a sound and picture synchronization judgment method is provided, and the method comprises the following steps:

acquiring first audio data corresponding to a target video frame in first multimedia data based on the first multimedia data received from a server;

determining first fingerprint information of the first audio data based on the first audio data;

comparing the first fingerprint information with at least one second fingerprint information, wherein one second fingerprint information is used for representing second audio data corresponding to the target video frame in second multimedia data received by the server, and the second multimedia data and the first multimedia data correspond to the same online playing process;

and if the first fingerprint information is the same as any one of the second fingerprint information, determining that the sound and the picture of the first multimedia data are synchronous.

In a possible implementation, before the comparing the first fingerprint information with the at least one second fingerprint information, the method further comprises:

and acquiring a message carrying the target video frame, and acquiring at least one piece of second fingerprint information from the acquired message.

In one possible implementation, the at least one second fingerprint information includes:

the fingerprint information of the second audio data with the same time stamp as the target video frame in the second multimedia data, and the fingerprint information of the second audio data with the time stamp interval with the target video frame less than the target duration in the second multimedia data.

In one possible implementation, before determining the first fingerprint information of the first audio data based on the first audio data, the method further includes:

the first audio data is down-sampled.

In a possible implementation, after comparing the first fingerprint information with at least one second fingerprint information, the method further includes:

and if the first fingerprint information is different from the at least one second fingerprint information, determining that the sound and the picture of the first multimedia data are asynchronous.

in the online playing process, at least one second audio data corresponding to a target video frame in second multimedia data is obtained;

determining at least one second fingerprint information based on the at least one second audio data;

adding the at least one piece of second fingerprint information to a message carrying the target video frame;

and sending the message to a server.

In a possible implementation, before the determining at least one second fingerprint information based on the at least one second audio data, the method further includes:

the at least one second audio data is down-sampled.

In one aspect, a sound and picture synchronization judging device is provided, the device comprising:

the data acquisition module is used for acquiring first audio data corresponding to a target video frame in the first multimedia data based on the first multimedia data received from the server;

the fingerprint information determining module is used for determining first fingerprint information of the first audio data based on the first audio data;

the comparison module is used for comparing the first fingerprint information with at least one piece of second fingerprint information, wherein one piece of second fingerprint information is used for representing second audio data corresponding to the target video frame in second multimedia data received by the server, and the second multimedia data and the first multimedia data correspond to the same online playing process;

and the determining module is used for determining the sound and picture synchronization of the first multimedia data if the first fingerprint information is the same as any one of the second fingerprint information.

In one possible implementation, the apparatus further includes:

the message acquisition module is used for acquiring a message for bearing the target video frame;

and the fingerprint information acquisition module is used for acquiring at least one piece of second fingerprint information from the acquired message.

In one possible implementation, the apparatus further includes:

and the down-sampling processing module is used for carrying out down-sampling processing on the first audio data.

In a possible implementation manner, the determining module is further configured to determine that the sound and the picture of the first multimedia data are not synchronous if the first fingerprint information is different from the at least one second fingerprint information.

the data acquisition module is used for acquiring at least one second audio data corresponding to a target video frame in the second multimedia data in the online playing process;

a fingerprint information determination module for determining at least one second fingerprint information based on the at least one second audio data;

the adding module is used for adding the at least one piece of second fingerprint information into a message bearing the target video frame;

and the sending module is used for sending the message to the server.

In one possible implementation, the apparatus further includes:

and the down-sampling processing module is used for performing down-sampling processing on the at least one second audio data.

In one aspect, a terminal is provided and includes one or more processors and one or more memories, where at least one program code is stored in the one or more memories, and the program code is loaded and executed by the one or more processors to implement the operations performed by the sound and picture synchronization determining method.

In one aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the program code is loaded and executed by a processor to implement the operations performed by the sound and picture synchronization determining method.

The method comprises the steps of obtaining first audio data corresponding to a target video frame of first multimedia data received from a server, determining first fingerprint information of the first audio data, comparing the first fingerprint information with at least one piece of second fingerprint information, and determining sound and picture synchronization of the first multimedia data if the first fingerprint information is the same as any one piece of second fingerprint information. Because the fingerprint has uniqueness, can be as the sign of audio data, through the machine method, extract the fingerprint of audio data to judge the sound picture synchronization according to the fingerprint, need not artifical observation, can improve sound picture synchronization judgement efficiency, can avoid the error that artifical judgement brought moreover, improve and judge the rate of accuracy.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a sound-picture synchronization determination method provided in an embodiment of the present disclosure;

fig. 2 is a flowchart of a method for determining synchronization between sound and picture according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a method for determining synchronization between sound and picture according to an embodiment of the present disclosure;

fig. 4 is a flowchart of a method for determining synchronization between sound and picture according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of a method for processing anchor multimedia data according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram illustrating a method for processing multimedia data at a user end according to an embodiment of the disclosure;

fig. 7 is a structural diagram of a sound-picture synchronization determination apparatus according to an embodiment of the present disclosure;

fig. 8 is a structural diagram of a sound-picture synchronization determination apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a terminal provided in an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Video frames of video data can be divided into I frames (I frames), P frames (P frames), and B frames (B frames), wherein the I frames are also called Intra pictures (Intra pictures), Intra coded frames (Intra frames), which are key frames, usually the first frame in a video sequence, which is an independent frame with all information, and a complete Picture is retained, so that the decoding can be performed independently without referring to other pictures; the P frame is also called forward Predicted Picture (Predicted Picture), the difference between the frame and the previous frame is reserved, the difference defined by the frame is overlapped by the Picture cached before decoding, the final Picture is generated, and then decoding is carried out; the B frame is also called Bi-predictive Picture (Bi-predictive Picture), and the difference between the current frame and the previous and subsequent frames is preserved, and the B frame is decoded by acquiring not only the previous buffered Picture but also the Picture of the subsequent frame, and the final Picture is acquired by overlapping the previous and subsequent pictures with the current frame data, and then decoding is performed.

Fingerprint of audio: refers to a characteristic value of the piece of audio, which can be distinguished from other audio data.

Fig. 1 is a schematic diagram of an implementation environment of a method for determining synchronization between sound and pictures provided in an embodiment of the present disclosure, and referring to fig. 1, the implementation environment includes: a first terminal 101, a server 102 and a second terminal 103.

The first terminal 101 may be at least one of a smart phone, a game console, a desktop computer, a tablet computer, a laptop computer, a desktop computer, an MP3 player (Moving Picture Experts Group Audio Layer III, mpeg Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, mpeg Audio Layer 4), a laptop computer, and the like. The first terminal 101 may be a viewer terminal, and the first terminal 101 may have an associated application, such as live software, installed and running thereon. The first terminal 101 may be connected to the server 102 through a wired network or a wireless network to receive the multimedia data transmitted by the server 102. The first terminal 101 may also calculate a fingerprint of the received audio data, and then determine whether the sound and the picture are synchronized.

The server 102 is at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The server 102 may be connected to the first terminal 102 and the second terminal 103 through a wireless network or a wired network, receive the multimedia data sent by the second terminal 103, and send the multimedia data to the first terminal 101, and the server 102 may further calculate a fingerprint of the audio data according to the received multimedia data, and add the fingerprint to a message of the multimedia data. Of course, the background server 102 of the application may also include other functional servers to provide more comprehensive and diversified services.

The second terminal 103 may be at least one of a smartphone, a game console, a desktop computer, a tablet computer, a laptop computer, a desktop computer, an MP3 player, an MP4 player, a laptop portable computer, and the like. The second terminal 103 may be a main broadcasting terminal, and the second terminal 103 may have related applications installed and running thereon, such as live broadcasting software, push streaming software, and the like. The second terminal 103 may collect video data through a camera assembly, collect audio data through a microphone assembly, and then obtain data for transmitting multimedia data. The second terminal 103 may be connected to the server 102 through a wired network or a wireless network to transmit multimedia data to the server 102. Before sending the multimedia data, the second terminal 103 may calculate the fingerprint of the audio data, and encapsulate the audio data, the fingerprint of the audio data, and the video data in a packet, and further encapsulate the packet in a data packet, and transmit the audio data and the video data through the data packet.

Those skilled in the art will appreciate that the number of the first terminals 101 and the second terminals 103 may be greater or smaller. For example, each of the first terminal and the second terminal may be only a few, or each of the first terminal and the second terminal may be dozens or hundreds, or more, and the number and the device type of the first terminal and the second terminal are not limited in the embodiments of the present disclosure. In some possible implementations, the first terminal and the second terminal may be the same terminal, for example, the anchor and the viewer may log in different clients on the same terminal, the anchor may be live on a first client of the terminal, and the viewer may watch the live on a second client of the terminal.

Fig. 2 is a flowchart of a sound-picture synchronization determining method provided by an embodiment of the present disclosure, and referring to fig. 2, the method is applied to a first terminal, for example, a viewer terminal, and the method includes:

201. based on first multimedia data received from a server, first audio data corresponding to a target video frame in the first multimedia data is obtained.

202. Based on the first audio data, first fingerprint information of the first audio data is determined.

203. And comparing the first fingerprint information with at least one second fingerprint information, wherein one second fingerprint information is used for representing second audio data corresponding to the target video frame in second multimedia data received by the server, and the second multimedia data and the first multimedia data correspond to the same online playing process.

204. And if the first fingerprint information is the same as any one of the second fingerprint information, determining that the sound and the picture of the first multimedia data are synchronous.

According to the scheme, the first audio data corresponding to the target video frame of the first multimedia data received from the server is obtained, the first fingerprint information of the first audio data is determined, the first fingerprint information is compared with at least one piece of second fingerprint information, and if the first fingerprint information is the same as any one piece of second fingerprint information, the sound and picture synchronization of the first multimedia data is determined. Because the fingerprint has uniqueness, can be as the sign of audio data, through the machine method, extract the fingerprint of audio data to judge the sound picture synchronization according to the fingerprint, need not artifical observation, can improve sound picture synchronization judgement efficiency, can avoid the error that artifical judgement brought moreover, improve and judge the rate of accuracy.

the first audio data is down-sampled.

Fig. 3 is a flowchart of a sound-picture synchronization determining method provided in an embodiment of the present disclosure, and referring to fig. 3, the method is applied to a second terminal, for example, an anchor terminal, and the method includes:

301. and in the online playing process, at least one second audio data corresponding to the target video frame in the second multimedia data is obtained.

302. Based on the at least one second audio data, at least one second fingerprint information is determined.

303. And adding the at least one piece of second fingerprint information into a message carrying the target video frame.

304. And sending the message to a server.

According to the scheme, at least one piece of second audio data corresponding to a target video frame of second multimedia data is obtained, at least one piece of second fingerprint information is determined, the at least one piece of second fingerprint information is added to a message bearing the target video frame, and the message is sent to a server. The fingerprint has uniqueness and can be used as an identifier of audio data, the fingerprint of the audio data is extracted through a machine method, the fingerprint is added into a message bearing multimedia data, the message is sent to a server and then sent to a first terminal by the server, the first terminal judges the sound and picture synchronization according to the fingerprint, manual observation is not needed, the sound and picture synchronization judging efficiency can be improved, errors caused by manual judgment can be avoided, and the judging accuracy is improved.

the at least one second audio data is down-sampled.

Fig. 4 is a flowchart of a method for determining synchronization of sound and picture provided by an embodiment of the present disclosure, where the method takes an interaction between a first terminal, a server, and a second terminal as an example, and referring to fig. 4, the method includes:

401. and in the online playing process, the second terminal acquires second multimedia data based on the received operation instruction.

It should be noted that the second terminal may be an anchor terminal.

In a possible implementation manner, the anchor may trigger a live instruction in the second terminal, for example, when the second terminal detects the live instruction, the second terminal may collect the video data through the camera assembly, collect the audio data through the microphone assembly, and print corresponding timestamps on the obtained video data and audio data according to the time for obtaining the video data and audio data, where the video data and audio data may be used as the second multimedia data.

The camera assembly and the audio circuit may be built in the second terminal, or may be an external assembly connected to the second terminal, which is not limited in the embodiments of the present disclosure.

402. And the second terminal acquires at least one second audio data corresponding to the target video frame in the second multimedia data.

It should be noted that the second terminal may use any one of the acquired video frames as a target video frame to obtain a unique target video frame, and may determine one target video frame every fixed frame number to obtain a plurality of target video frames. In addition, when the target video frame is selected, since the information carried by the I frame is most complete, the I frame may be used as the target video frame to acquire data, and optionally, the second terminal may further use another video frame as the target video frame, which is not limited in this disclosure.

In a possible implementation manner, the second terminal may use the acquired first video frame as a target video frame, and query, according to a timestamp of the video frame, at least one piece of second audio data that is closest to a timestamp time interval of the target video frame in the audio queue. For example, when the second terminal uses an I frame as a target video frame, 4 pieces of second audio data closest to the time interval of the time stamp may be queried in the obtained second multimedia data according to the time stamp of the I frame, optionally, the number of second audio data corresponding to the target video frame that is also obtained by the second terminal may be more or less, and the number of second audio data corresponding to the target video frame is not limited in the embodiment of the present disclosure.

In another possible implementation manner, the second terminal may further determine a target video frame every other fixed frame number according to the position of the first target video frame, and further perform acquisition of at least one second audio data corresponding to each target video frame based on the determined plurality of target video frames, where the fixed frame number at intervals is not limited in the embodiment of the present disclosure.

403. The second terminal down-samples the at least one second audio data.

In a possible implementation manner, the second terminal passes at least one second audio data through the low-pass filter one by one, and samples the audio data output by the low-pass filter, for example, M times of the audio data output by the low-pass filter may be sampled, that is, a point is extracted at each M-1 point, where M is a positive integer greater than or equal to 1, and a specific value of M is not limited in the embodiment of the present disclosure.

It should be noted that the low-pass filter can remove the high-frequency component in the second audio signal to prevent aliasing, and the data size can be reduced through the down-sampling process, thereby providing convenience for the subsequent processing process.

404. The second terminal determines at least one second fingerprint information based on the at least one second audio data after the down-sampling processing.

It should be noted that the fingerprint can be used as a unique identifier of a piece of audio, representing a unique digital characteristic of the audio.

In a possible implementation manner, the second terminal performs windowing on at least one second audio data subjected to downsampling one by one, performs fast fourier transform on the windowed audio data, divides frequency spectrum bands, and takes a peak signal in each frequency spectrum band as a signature of the frequency spectrum band, so as to construct fingerprint information of each second audio data as the second fingerprint information.

The second terminal may use a hanning window when performing windowing on the downsampled second audio data, and optionally may select another sliding window to perform windowing on the audio data.

405. And the second terminal adds the at least one piece of second fingerprint information to a message carrying the target video frame.

It should be noted that the message is a unit of network transmission, and is continuously encapsulated into packets, and frames for transmission by adding some information segments during the transmission process. Reserved fields can be preset in the message to store the fingerprint information.

In a possible implementation manner, the second terminal may encode the second multimedia data, encapsulate the second multimedia data in a message, and add at least one piece of second fingerprint information to a reserved field of a custom message corresponding to the target video frame for subsequent transmission.

Referring to fig. 5, fig. 5 is a schematic diagram of a processing method for multimedia data of a main broadcast terminal according to an embodiment of the present disclosure, which illustrates an example that fingerprint information of a second audio data closest to an I frame is calculated as second fingerprint information and is packed into a video I frame message, where the above steps 403 to 405 can be shown visually, and a second terminal may find the audio data closest to the I frame in an audio queue according to timestamps of each video frame and each audio frame, perform downsampling on the audio data, calculate a fingerprint of the audio data, add the fingerprint of the audio data to the I frame message, and use the fingerprint of the audio data in subsequent audio-video synchronization determination.

It should be noted that, when the second terminal adds at least one piece of second fingerprint information to the custom message corresponding to the target video frame, the timestamp information of the second audio data corresponding to each fingerprint may also be added to the reserved field of the message.

406. And the second terminal sends the message to the server.

In a possible implementation manner, the second terminal may send the encapsulated second multimedia data to a server closest to the second terminal in the server cluster in a form of a message through the stream pushing software installed on the second terminal.

407. The server processes the second multimedia data in the received message to obtain first multimedia data, and sends the message bearing the first multimedia data to the first terminal.

It should be noted that the first terminal may be a viewer terminal, a viewer may select a anchor on the second terminal on the first terminal, and trigger a live broadcast watching button corresponding to the anchor on the second terminal to watch live broadcast of the anchor on the second terminal, the first terminal may send a multimedia data acquisition request to the server according to a trigger operation of the user, and the server sends a message carrying first multimedia data to the first terminal in response to the multimedia data acquisition request.

In a possible implementation manner, after receiving second multimedia data sent by a second terminal, a server may decode the second multimedia data, and when any server in a server cluster receives a multimedia data acquisition request from a first terminal, according to a multimedia data identifier carried in an audio/video acquisition request sent by the first terminal, the server closest to the second terminal may acquire the decoded second multimedia data, encode the decoded multimedia data to obtain first multimedia data, encapsulate the first multimedia data into a message, and send the message carrying the first multimedia data to the first terminal.

408. The first terminal obtains first audio data corresponding to a target video frame in the first multimedia data based on the received first multimedia data, wherein the first multimedia data and the second multimedia data correspond to the same online playing process.

In a possible implementation manner, after receiving a message carrying first multimedia data sent by a server, a first terminal may decode the message carrying the first multimedia data, so as to obtain the first multimedia data, where the first multimedia data and the second multimedia data correspond to a same online playing process. The first terminal may detect each video frame in the received first multimedia data, and when detecting that a reserved field of a packet of a certain video frame includes fingerprint information, may determine that the video frame is a target video frame. The first terminal may also stamp the received videos and audios with corresponding timestamps, and based on the timestamp of the target video frame, obtain the first audio data with the minimum time interval represented by the timestamp, that is, the first audio data corresponding to the target video frame.

409. The first terminal performs down-sampling processing on the first audio data.

In a possible implementation manner, the first terminal passes the first audio data through a low-pass filter, and samples the audio data output by the low-pass filter, for example, N times of the audio data output by the low-pass filter may be sampled, that is, a point is extracted at each N-1 point, where N is a positive integer greater than or equal to 1, and a specific value of N is not limited in the embodiment of the present disclosure.

It should be noted that the low-pass filter can remove the high-frequency component in the first audio signal to prevent aliasing, and the data size can be reduced through the down-sampling process, thereby providing convenience for the subsequent processing process.

410. The first terminal determines first fingerprint information of the first audio data based on the first audio data after the down-sampling processing.

In a possible implementation manner, the first terminal performs windowing on the first audio data after the downsampling processing, performs fast fourier transform on the windowed audio data, divides frequency spectrum bands, and takes a peak signal in each frequency spectrum band as a signature of the frequency spectrum band, so as to construct fingerprint information of each first audio data as second fingerprint information.

The first terminal may adopt a hanning window when performing windowing on the first audio data after the downsampling process, and optionally may select another sliding window to perform windowing on the audio data.

411. And the first terminal acquires a message carrying the target video frame and acquires at least one piece of second fingerprint information from the acquired message.

In a possible implementation manner, the first terminal may decrypt the packet carrying the second multimedia data, obtain a decrypted packet corresponding to the target video frame, and extract the at least one piece of second fingerprint information from a reserved field of the decrypted packet.

It should be noted that, while the first terminal acquires at least one piece of second fingerprint information from the packet, the first terminal may also acquire timestamp information of second audio data corresponding to each piece of second fingerprint information.

412. The first terminal compares the first fingerprint information with at least one second fingerprint information, and one second fingerprint information is used for representing second multimedia received by the server and second audio data corresponding to the target video frame in the data.

In a possible implementation manner, the first terminal may compare the first fingerprint information with any one of the at least one second fingerprint information to determine a sound-picture synchronization condition.

In another possible implementation manner, the first terminal may sort, according to the obtained timestamp information of each piece of second fingerprint information, the pieces of second fingerprint information in an order from small to large of a time interval between the time indicated by the timestamp information of each piece of second fingerprint information and the time indicated by the timestamp information corresponding to the first fingerprint information, in combination with the time indicated by the timestamp of the first audio data corresponding to the first fingerprint information, and compare, according to a result of the sorting, each piece of second fingerprint information with the piece of first fingerprint information to determine a sound-picture synchronization condition. The embodiment of the present disclosure does not limit which manner is specifically adopted.

413. And if the first fingerprint information is the same as any one of the second fingerprint information, the first terminal determines that the sound and the picture of the first multimedia data are synchronous, and if the first fingerprint information is different from the at least one second fingerprint information, the first terminal determines that the sound and the picture of the first multimedia data are asynchronous.

It should be noted that, the first fingerprint information is compared with the acquired at least one piece of second fingerprint information, and when the first fingerprint information is the same as one of the second fingerprint information, it may be determined that the sound and the picture of the first multimedia data are synchronous, and when the first fingerprint information is not the same as the at least one piece of second fingerprint information, it may be determined that the sound and the picture of the first multimedia data are asynchronous. Compare first fingerprint information and a plurality of second fingerprint information, allow the sound picture synchronous existence certain tiny error, for example, can compare first fingerprint information and 4 nearest second fingerprint information, because only compare with 4 nearest second fingerprint information for human eyes can't accurately discern this tiny error, still can guarantee the experience effect when the user watches the video.

In other possible implementation manners, the first terminal may further obtain, in step 408, a plurality of first audio data corresponding to the target video frame of the first multimedia data, perform a similar processing procedure as that in steps 409 to 412 on the plurality of first audio data, and further perform sound and picture synchronization determination, where when one of the plurality of first fingerprint information is the same as any one of the at least one second fingerprint information, it may be determined that the sound and picture are synchronized, and when the plurality of first fingerprint information is different from the at least one second fingerprint information, it may be determined that the sound and picture are not synchronized.

Referring to fig. 6, fig. 6 is a schematic diagram of a method for processing multimedia data at a user terminal according to an embodiment of the present disclosure, which illustrates a comparison between first fingerprint information and second fingerprint information closest to the first fingerprint information, and may visually show the above steps 409 to 412, where the first terminal may identify an audio data packet closest to an I frame in an audio queue, perform downsampling on the audio data packet, thereby calculating first fingerprint information corresponding to the data packet, extract the second fingerprint information from a packet of the I frame in a video queue, compare the first fingerprint information with the second fingerprint information, and when the first fingerprint information and the second fingerprint information are the same, it may be determined that a sound and a picture of the first multimedia data are synchronous, and when the first fingerprint information and the second fingerprint information are different, it may be determined that the sound and the picture of the first multimedia data are asynchronous.

It should be noted that, the above process is only described by taking a live broadcast process as an example, in other possible implementation manners, the scheme may also be applied to other online playing processes, and the embodiment of the present disclosure does not limit this.

By the scheme, in the online playing process, the anchor terminal can take the target video frame as a reference point, down-sample the audio packet closest to the anchor terminal (or close to the anchor terminal) and calculate the fingerprint of the audio, and place the calculated fingerprint in the user-defined message behind the target video frame. The user end finds the audio packets closest to the user end (or close to the user end) through the target video frame, calculates the fingerprint of the audio through the same method as that of the anchor end, and compares whether the calculated fingerprint is the same as the self-defined message brought by the target video frame, thereby judging the sound-picture synchronization condition. Because the fingerprint has uniqueness, the fingerprint can be used as an identifier of audio data, the fingerprint of the audio data is extracted through a machine method, the multimedia data comprising the fingerprint is sent to the server and sent to the first terminal by the server, the first terminal synchronously judges the sound and the picture according to the fingerprint without manual observation, the efficiency of synchronously judging the sound and the picture can be improved, errors caused by manual judgment can be avoided, and the judgment accuracy is improved.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

Fig. 7 is a structural diagram of a device for determining synchronization between sound and picture according to an embodiment of the present disclosure, and referring to fig. 7, the device includes:

a data obtaining module 701, configured to obtain first audio data corresponding to a target video frame in first multimedia data based on the first multimedia data received from a server;

a fingerprint information determining module 702, configured to determine first fingerprint information of the first audio data based on the first audio data;

a comparing module 703, configured to compare the first fingerprint information with at least one second fingerprint information, where one second fingerprint information is used to represent second audio data corresponding to the target video frame in second multimedia data received by the server, and the second multimedia data and the first multimedia data correspond to a same online playing process;

a determining module 704, configured to determine that the audio and video of the first multimedia data are synchronized if the first fingerprint information is the same as any of the second fingerprint information.

In one possible implementation, the apparatus further includes:

In a possible implementation manner, the determining module 704 is further configured to determine that the first multimedia data is not synchronized with the sound and picture if the first fingerprint information and the at least one second fingerprint information are different.

The device determines first fingerprint information of the first audio data by acquiring the first audio data corresponding to a target video frame of the first multimedia data received from the server, compares the first fingerprint information with at least one piece of second fingerprint information, and determines that the sound and the picture of the first multimedia data are synchronous if the first fingerprint information is the same as any one piece of second fingerprint information. Because the fingerprint has uniqueness, can be as the sign of audio data, through the machine method, extract the fingerprint of audio data to judge the sound picture synchronization according to the fingerprint, need not artifical observation, can improve sound picture synchronization judgement efficiency, can avoid the error that artifical judgement brought moreover, improve and judge the rate of accuracy.

It should be noted that: the sound-picture synchronization determining apparatus provided in the above embodiment is only illustrated by dividing the functional modules when determining sound-picture synchronization, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the terminal is divided into different functional modules to complete all or part of the functions described above. In addition, the sound and picture synchronization determination apparatus and the sound and picture synchronization determination method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 8 is a structural diagram of a device for judging synchronization between sound and picture provided by an embodiment of the present disclosure, and referring to fig. 8, the device includes:

a data obtaining module 801, configured to obtain at least one second audio data corresponding to a target video frame in second multimedia data in an online playing process;

a fingerprint information determining module 802 for determining at least one second fingerprint information based on the at least one second audio data;

an adding module 803, configured to add the at least one piece of second fingerprint information to a packet carrying the target video frame;

a sending module 804, configured to send the message to a server.

The device determines at least one piece of second fingerprint information by acquiring at least one piece of second audio data corresponding to a target video frame of second multimedia data, adds the at least one piece of second fingerprint information to a message bearing the target video frame, and sends the message to a server. The fingerprint has uniqueness and can be used as an identifier of audio data, the fingerprint of the audio data is extracted through a machine method, the fingerprint is added into a message bearing multimedia data, the message is sent to a server and then sent to a first terminal by the server, the first terminal judges the sound and picture synchronization according to the fingerprint, manual observation is not needed, the sound and picture synchronization judging efficiency can be improved, errors caused by manual judgment can be avoided, and the judging accuracy is improved.

In one possible implementation, the apparatus further includes:

Fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present disclosure. The terminal 900 may be: a smartphone, a tablet, an MP3 player, an MP4 player, a laptop, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

In general, terminal 900 includes: one or more processors 901 and one or more memories 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 902 is used for storing at least one program code for execution by the processor 901 to implement the sound and picture synchronization determination method provided by the method embodiments in the present disclosure.

In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 904, display screen 905, camera 906, audio circuitry 907, positioning component 908, and power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.

The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, providing the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in still other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display panel 905 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The positioning component 908 is used to locate the current geographic Location of the terminal 900 for navigation or LBS (Location Based Service). The Positioning component 908 may be a Positioning component based on the GPS (Global Positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 909 is used to provide power to the various components in terminal 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When power source 909 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.

The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the display screen 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 912 may detect a body direction and a rotation angle of the terminal 900, and the gyro sensor 912 may cooperate with the acceleration sensor 911 to acquire a 3D motion of the user on the terminal 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 913 may be disposed on a side bezel of the terminal 900 and/or underneath the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the user's holding signal of the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the display screen 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the terminal 900. When a physical key or vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.

The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display screen 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is reduced. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.

Proximity sensor 916, also known as a distance sensor, is typically disposed on the front panel of terminal 900. The proximity sensor 916 is used to collect the distance between the user and the front face of the terminal 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user and the front face of the terminal 900 gradually decreases, the processor 901 controls the display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user and the front surface of the terminal 900 gradually becomes larger, the display 905 is controlled by the processor 901 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 10 is a schematic structural diagram of a server provided in an embodiment of the present disclosure, where the server 1000 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1001 and one or more memories 1002, where the one or more memories 1002 store at least one program code, and the at least one program code is loaded and executed by the one or more processors 1001 to implement the methods provided by the foregoing method embodiments. Of course, the server 1000 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 1000 may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, including program code, which is executable by a processor to perform the sound-picture synchronization judging method in the above-described embodiments is also provided. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by hardware associated with program code, and the program may be stored in a computer readable storage medium, where the above mentioned storage medium may be a read-only memory, a magnetic or optical disk, etc.

The foregoing is considered as illustrative of the embodiments of the disclosure and is not to be construed as limiting thereof, and any modifications, equivalents, improvements and the like made within the spirit and principle of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. A sound and picture synchronization judging method is characterized by comprising the following steps:

2. The method of claim 1, wherein prior to comparing the first fingerprint information to at least one second fingerprint information, the method further comprises:

3. The method of claim 1, wherein the at least one second fingerprint information comprises:

fingerprint information of second audio data with the same timestamp as the target video frame in the second multimedia data, and fingerprint information of second audio data with a timestamp interval with the target video frame smaller than a target duration in the second multimedia data.

4. The method of claim 1, wherein prior to determining the first fingerprint information for the first audio data based on the first audio data, the method further comprises:

and performing down-sampling processing on the first audio data.

5. The method of claim 1, wherein after comparing the first fingerprint information to at least one second fingerprint information, the method further comprises:

and if the first fingerprint information is different from the at least one second fingerprint information, determining that the sound and the picture of the first multimedia data are not synchronous.

6. A sound and picture synchronization judging method is characterized by comprising the following steps:

and sending the message to a server, wherein the message is used for comparing first fingerprint information with at least one piece of second fingerprint information by the server, and if the first fingerprint information is the same as any one of the second fingerprint information, determining that the sound and the picture of the first multimedia data are synchronous, wherein the first fingerprint information is the fingerprint information determined by the first audio data corresponding to the target video frame of the first multimedia data.

7. The method of claim 6, wherein before determining at least one second fingerprint information based on the at least one second audio data, the method further comprises:

down-sampling the at least one second audio data.

8. A sound and picture synchronization judging device, characterized in that the device comprises:

the data acquisition module is used for acquiring first audio data corresponding to a target video frame in first multimedia data based on the first multimedia data received from a server;

a fingerprint information determination module for determining first fingerprint information of the first audio data based on the first audio data;

9. The apparatus of claim 8, further comprising:

10. The apparatus of claim 8, further comprising:

and the down-sampling processing module is used for performing down-sampling processing on the first audio data.

11. The apparatus of claim 8, wherein the determining module is further configured to determine that the first multimedia data is not synchronized with the sound and picture if the first fingerprint information is different from the at least one second fingerprint information.

12. A sound and picture synchronization judging device, characterized in that the device comprises:

and the sending module is used for sending the message to a server, wherein the message is used for comparing first fingerprint information with at least one piece of second fingerprint information by the server, and if the first fingerprint information is the same as any one of the second fingerprint information, the sound and picture synchronization of the first multimedia data is determined, and the first fingerprint information is fingerprint information determined by the first audio data corresponding to the target video frame of the first multimedia data.

13. The apparatus of claim 12, further comprising:

14. A terminal, characterized in that the terminal comprises one or more processors and one or more memories, wherein at least one program code is stored in the one or more memories, and the program code is loaded and executed by the one or more processors to implement the operations performed by the sound-picture synchronization judging method according to any one of claims 1 to 5; or the operation performed by the sound-picture synchronization judging method according to any one of claim 6 to claim 7.

15. A computer-readable storage medium, wherein at least one program code is stored in the computer-readable storage medium, and the program code is loaded and executed by a processor to implement the operations performed by the sound-picture synchronization judging method according to any one of claims 1 to 5; or the operation performed by the sound-picture synchronization judging method according to any one of claim 6 to claim 7.